Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 110]
cs.CV [Total: 182]
cs.AI [Total: 48]
cs.SD [Total: 14]
cs.LG [Total: 154]
cs.MA [Total: 1]
cs.MM [Total: 1]
eess.AS [Total: 1]
eess.IV [Total: 10]

cs.CL

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Zahra Abedi, Richard M. K. van Dijk, Gijs Wijnholds, Tessa Verhoef

Main category: cs.CL

TL;DR: This paper presents an automated pipeline combining OCR, LLM-based interpretation, and database linking to digitize and analyze historical professor records from Leiden University books, achieving good accuracy in data extraction and record linkage.

Details

Motivation: To develop an automated system for digitizing and harmonizing historical biographic data from typewritten documents with existing high-quality database records, addressing challenges in digital humanities like layout variability and terminology differences.

Method: Applied OCR techniques, generative AI with decoding constraints for structured data extraction, and database linkage methods to process historical typewritten records into digital format.

Result: OCR achieved CER of 1.08% and WER of 5.06%. JSON extraction from OCR text achieved 63% accuracy (65% from annotated OCR). Record linkage achieved 94% accuracy with annotated JSON and 81% with OCR-derived JSON, showing generative AI can correct low OCR performance.

Conclusion: The study contributes to digital humanities by offering an effective automated pipeline for interpreting historical documents, demonstrating the applicability of generative AI models for correcting OCR errors and harmonizing data with existing databases.

Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano

Main category: cs.CL

TL;DR: CAT framework evaluates LLM accuracy and response consistency interplay using CAR curves and CORE index, moving beyond independent metrics to assess their trade-off.

Details

Motivation: Current LLM evaluation focuses on accuracy/benchmark scores, with consistency emerging as important for real-world deployment. However, existing methods evaluate these dimensions independently without considering their interdependency, which is crucial for nuanced LLM assessment.

Method: CAT framework introduces Consistency-Accuracy Relation (CAR) curves that visualize how accuracy varies with increasing consistency requirements, measured by Minimum-Consistency Accuracy (MCA). Also proposes Consistency-Oriented Robustness Estimate (CORE) index that combines CAR curve area and shape to quantify accuracy-consistency trade-off.

Result: Demonstrated framework across diverse generalist and domain-specific LLMs on multiple-choice benchmarks. Showed how CAT can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

Conclusion: CAT provides a comprehensive framework for evaluating LLM accuracy-consistency interplay, offering both visualization (CAR curves) and quantification (CORE index) to better assess model reliability for real-world applications.

Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanghui Wang, Jinze Yu, Xing Zhang, Dayuan Jiang, Yin Song, Tomal Deb, Xuefeng Liu, Peiyang He

Main category: cs.CL

TL;DR: A framework for evaluating and improving consistency in LLM-generated structured outputs using STED metric and consistency scoring, showing Claude-3.7-Sonnet as most consistent model.

Details

Motivation: LLMs are increasingly used for structured data generation, but output consistency is critical for production applications. Current methods lack comprehensive evaluation of consistency in structured outputs.

Method: Combines (1) STED (Semantic Tree Edit Distance) - a novel similarity metric balancing semantic flexibility with structural strictness for JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations.

Result: STED achieves superior performance (0.86-0.90 similarity for semantic equivalents, 0.0 for structural breaks) vs existing metrics. Claude-3.7-Sonnet shows exceptional consistency, maintaining near-perfect structural reliability even at high temperatures, while other models degrade significantly.

Conclusion: The framework enables practical applications for model selection, prompt refinement, and diagnostic analysis, providing theoretical foundations and practical tools for reliable structured output generation in LLM-based production systems.

Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Jahidul Islam, Md Ataullha, Saiful Azad

Main category: cs.CL

TL;DR: BanglaCodeAct is an agent-based framework that enables Bangla-to-Python code generation using multilingual LLMs with iterative self-correction, achieving state-of-the-art performance on Bangla NL2Code tasks.

Details

Motivation: While LLMs excel at English-to-code generation, this progress hasn't extended to low-resource languages like Bangla. There's a need for effective Bangla-to-Python code generation without requiring task-specific fine-tuning.

Method: BanglaCodeAct uses an agent-based framework with multi-agent prompting and iterative self-correction. It employs open-source multilingual LLMs within a Thought-Code-Observation loop for dynamic generation, testing, and refinement of code from Bangla instructions.

Result: Qwen3-8B with BanglaCodeAct achieves best performance: 94.0% pass@1 on development set and 71.6% on blind test set of mHumanEval dataset for Bangla NL2Code, establishing a new benchmark for Bangla-to-Python translation.

Conclusion: The research demonstrates the potential of agent-based reasoning for reliable code generation in low-resource languages, establishing a new benchmark for Bangla-to-Python translation without requiring task-specific fine-tuning.

Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0% on the development set and 71.6% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.

[5] MiMo-Audio: Audio Language Models are Few-Shot Learners

Xiaomi LLM-Core Team, :, Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xingchen Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wenshan Huang, Wenyu Yang, Yilin Jiang, Yixin Yang, Yuanyuan Tian, Yue Ma, Yue Yu, Zihan Zhang, Zihao Yue, Bangjun Xiao, Bingquan Xia, Bofei Gao, Bowen Ye, Can Cai, Chang Liu, Chenhong He, Chunan Li, Dawei Zhu, Duo Zhang, Fengyuan Shi, Guoan Wang, Hailin Zhang, Hanglong Lv, Hanyu Li, Hao Tian, Heng Qu, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianguang Zuo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Linghao Zhang, Meng Chen, Nuo Chen, Peng Zhang, Qianli Chen, Qiantong Wang, Rang Li, Shaohui Liu, Shengfan Wang, Shicheng Li, Shihua Yu, Shijie Cao, Shimao Chen, Shuhao Gu, Weikun Wang, Wenhan Ma, Xiangwei Deng, Xing Yong, Xing Zhang, Xu Wang, Yifan Song, Yihao Zhao, Yingbo Zhao, Yizhao Gao, Yu Cheng, Yu Tu, Yudong Wang, Zhaojun Huang, Zhengju Tang, Zhenru Lin, Zhichao Song, Zhipeng Xu, Zhixian Zheng, Zihan Jiang

Main category: cs.CL

TL;DR: MiMo-Audio scales audio pretraining to 100M+ hours, achieving emergent few-shot learning and SOTA performance across diverse audio tasks without task-specific fine-tuning.

Details

Motivation: Current audio models require task-specific fine-tuning, unlike humans who generalize from few examples. Inspired by GPT-3's scaling success in text, the authors apply similar scaling to audio to achieve generalization capabilities.

Method: Scale next-token prediction pretraining to 100M+ hours of audio data (MiMo-Audio-7B-Base), then apply instruction-tuning with thinking mechanisms for both understanding and generation tasks (MiMo-Audio-7B-Instruct).

Result: Achieves SOTA among open-source models on speech intelligence and audio understanding benchmarks. Shows emergent few-shot learning, generalization to unseen tasks (voice conversion, style transfer), and powerful speech continuation capabilities. Instruction-tuned version approaches/surpasses closed-source models on multiple benchmarks.

Conclusion: Scaling audio pretraining enables emergent few-shot learning and strong generalization, similar to GPT-3’s success in text. MiMo-Audio demonstrates the viability of foundation models for audio that can handle diverse tasks without task-specific fine-tuning.

Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio’s pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

[6] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

Tingwei Xie, Tianyi Zhou, Yonghong Song

Main category: cs.CL

TL;DR: PharmaShip is a Chinese pharmaceutical shipping document dataset with noisy OCR and diverse templates for testing text-layout models on SER, RE, and ROP tasks, showing that combining pixel, geometry, and reading-order information yields best performance.

Details

Motivation: To create a real-world benchmark for evaluating pre-trained text-layout models under challenging conditions (noisy OCR, heterogeneous templates) in the safety-critical pharmaceutical domain, where accurate document understanding is essential.

Method: Created PharmaShip dataset with scanned pharmaceutical shipping documents, designed three complementary tasks (SER, RE, ROP), used entity-centric evaluation protocol, benchmarked five representative baselines (LiLT, LayoutLMv3-base, GeoLayoutLM and RORE-enhanced variants), standardized preprocessing, splits, and optimization.

Result: Pixels and explicit geometry provide complementary inductive biases but neither alone is sufficient; reading-order-oriented regularization consistently improves SER and EL; longer positional coverage stabilizes late-page predictions; ROP is accurate at word level but challenging at segment level due to boundary ambiguity and long-range crossings.

Conclusion: PharmaShip establishes a controlled, reproducible benchmark for pharmaceutical document understanding and highlights sequence-aware constraints as a transferable bias for structure modeling, with the best configuration combining pixel, geometry, and reading-order information.

Abstract: We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.

[7] Noise-Driven Persona Formation in Reflexive Neural Language Generation

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: LN-RP is a computational framework that uses stochastic noise injection to study how personas emerge in LLMs, revealing three stable persona modes with distinct entropy patterns.

Details

Motivation: To develop a reproducible method for studying noise-driven persona emergence, reflexive generation dynamics, and long-range linguistic coherence in large language models.

Method: The Luca-Noise Reflex Protocol injects stochastic noise seeds into initial generation states and analyzes linguistic behavior across 152 generation cycles to observe nonlinear transitions.

Result: The study reveals three stable persona modes with distinct entropy signatures, shows external noise can reliably induce phase transitions, and confirms consistent persona retention with significant differences across modes (p < 0.01).

Conclusion: LN-RP provides a reproducible framework for studying reflexive generation, emergent behavior, and long-range linguistic coherence in LLMs through controlled noise injection.

Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.

[8] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

Main category: cs.CL

TL;DR: HarmTransform is a multi-agent debate framework that systematically transforms harmful queries into stealthier forms to improve LLM safety alignment by generating better training data.

Details

Motivation: Current LLM safety mechanisms focus on overtly dangerous content but fail to detect subtle threats where users disguise harmful intent through covert rephrasing that appears benign, creating a gap in safety training data.

Method: HarmTransform uses a multi-agent debate framework with iterative critique and refinement among multiple agents to systematically transform harmful queries into stealthier forms while preserving underlying harmful intent.

Result: Experiments show HarmTransform significantly outperforms standard baselines in producing effective query transformations. Debate acts as a double-edged sword: it sharpens transformations and improves stealth, but may also introduce topic shifts and unnecessary complexity.

Conclusion: Multi-agent debate shows promise for generating comprehensive safety training data, but has limitations including potential topic shifts and complexity, highlighting the need for balanced approaches to LLM safety alignment.

Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.

[9] Emergent World Beliefs: Exploring Transformers in Stochastic Games

Adam Kamel, Tanish Rastogi, Michael Ma, Kailash Ranganathan, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs can learn world models for partially observable environments like poker, representing both deterministic (hand ranks) and stochastic (equity) features without explicit training.

Details

Motivation: Extend investigation of LLM world models from perfect information games to incomplete information domains, using poker as a canonical POMDP to understand how models represent stochastic environments.

Method: Pretrained GPT-style model on Poker Hand History (PHH) data and probed internal activations using primarily nonlinear probes to analyze learned representations.

Result: Model learned both deterministic structure (hand ranks) and stochastic features (equity) without explicit instruction. Representations are decodeable and correlate with theoretical belief states.

Conclusion: LLMs can learn their own representation of stochastic environments in partially observable domains like Texas Hold’em Poker, suggesting emergent world modeling capabilities extend beyond perfect information games.

Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold’em Poker.

[10] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

Hengli Li, Zhaoxin Yu, Qi Shen, Chenxi Li, Mengmeng Wang, Tinglang Wu, Yipeng Kang, Yuxuan Wang, Song-Chun Zhu, Zixia Jia, Zilong Zheng

Main category: cs.CL

TL;DR: BEDA framework bridges belief estimation and strategic dialogue generation by formalizing adversarial/alignment acts and using probabilistic constraints to guide utterance generation.

Details

Motivation: Prior work accurately estimates beliefs but lacks principled mechanisms to use those beliefs during dialogue generation, creating a gap between belief estimation and strategic dialogue execution.

Method: Formalizes adversarial and alignment dialogue acts, operationalizes them via probabilistic constraints, and implements BEDA framework with world set, belief estimator, and conditional generator that selects acts consistent with inferred beliefs.

Result: BEDA consistently outperforms baselines across three settings: improves success rate by 5.0+ points on CKBG (20.6 points with GPT-4.1-nano), 9.3 points on Mutual Friends, and achieves optimal deal on CaSiNo.

Conclusion: Casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue, bridging the gap between belief estimation and generation.

Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.

[11] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

Anwar Alajmi, Gabriele Pergola

Main category: cs.CL

TL;DR: Two-stage framework combining targeted training procedures with reasoning-based inference to detect subtle sexist content, addressing data scarcity, noise, and conceptual ambiguity.

Details

Motivation: Traditional detection methods fail to identify subtle, context-dependent sexist content due to overlapping linguistic, psychological, legal, and cultural dimensions that create mixed signals, label scarcity, class imbalance, and unstable decision boundaries.

Method: Two-stage framework: (1) Training with class-balanced focal loss, class-aware batching, and threshold calibration to handle imbalance and noise; (2) Inference with dynamic routing that sends high-confidence cases directly to classification and uncertain cases to a Collaborative Expert Judgment module with multiple personas and a judge model.

Result: State-of-the-art performance: +2.72% F1 improvement on EXIST 2025 Task 1.1, +4.48% and +1.30% gains on EDOS Tasks A and B respectively.

Conclusion: The proposed framework effectively addresses the combined challenges of underrepresentation, noise, and conceptual ambiguity in sexist content detection through unified training and reasoning-based inference approaches.

Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48% and +1.30% on the EDOS Tasks A and B, respectively.

[12] Break Out the Silverware – Semantic Understanding of Stored Household Items

Michaela Levi-Richter, Reuth Mirsky, Oren Glickman

Main category: cs.CL

TL;DR: A benchmark for evaluating robots’ ability to predict where household items are stored when not visible, with a hybrid vision-language model (NOAM) that approaches human-level performance.

Details

Motivation: Domestic service robots lack commonsense reasoning to find everyday items stored out of sight in drawers, cabinets, or closets, despite advances in vision and manipulation.

Method: Introduced the Stored Household Item Challenge benchmark with two datasets: 100 real-world item-image pairs and 6,500 development pairs. Created NOAM, a hybrid pipeline that converts visual input to natural language descriptions of spatial context and visible containers, then uses LLMs (like GPT-4) to infer hidden storage locations.

Result: NOAM significantly outperforms baselines (random selection, vision-language pipelines, multimodal models) and approaches human-level prediction accuracy for item storage location inference.

Conclusion: The benchmark enables comparative evaluation of robotic cognitive capabilities, and NOAM demonstrates effective commonsense reasoning through integrated vision-language processing, highlighting best practices for deploying cognitively capable domestic agents.

Abstract: ``Bring me a plate.’’ For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots’ cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants’ kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

[13] Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

Yun Tang, Cindy Tseng

Main category: cs.CL

TL;DR: Chunk SSL: A self-supervised learning method for both streaming and offline speech pre-training using chunk-based masked prediction with finite scalar quantization and group masking to handle partial utterances efficiently.

Details

Motivation: Current self-supervised learning algorithms assume full utterances, making them suboptimal for streaming applications with partial utterances. There's a need for a unified solution that works for both streaming and offline speech pre-training.

Method: Proposes Chunk SSL with chunk-based masked prediction loss, copy-and-append data augmentation, finite scalar quantization (FSQ) with large codebooks (up to millions), and group masked prediction to reduce memory/computation costs.

Result: Achieves competitive results on Librispeech (speech recognition) and Must-C (speech translation) datasets for both streaming and offline modes in speech-to-text tasks.

Conclusion: Chunk SSL provides an effective unified solution for streaming and offline speech pre-training, demonstrating that large FSQ codebooks with group masking enable efficient knowledge transfer to downstream tasks.

Abstract: Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the \textsc{Librispeech} and \textsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.

[14] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su, Meicong Zhang, Guoxiu He

Main category: cs.CL

TL;DR: EASD enhances speculative decoding by adding entropy-based penalty to reject uncertain draft tokens, enabling performance surpassing the target LLM while maintaining efficiency.

Details

Motivation: Standard speculative decoding is limited by excessive alignment between draft and target models, constraining performance to the target LLM's capabilities. The authors aim to overcome this limitation and potentially surpass the target model's performance.

Method: EASD builds on standard speculative decoding by incorporating a dynamic entropy-based penalty. It uses entropy of sampling distributions to quantify model uncertainty, rejecting and re-sampling tokens when both models show high entropy with substantial overlap in top-N predictions, preventing low-confidence error propagation.

Result: Experiments across multiple reasoning benchmarks show EASD consistently outperforms existing speculative decoding methods and, in most cases, surpasses the target LLM itself. The efficiency of EASD is proven to be comparable to standard SD.

Conclusion: EASD is an effective training-free enhancement to speculative decoding that enables performance surpassing the target LLM while maintaining computational efficiency, addressing limitations of traditional speculative decoding approaches.

Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model’s inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.

[15] Fun-Audio-Chat Technical Report

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou

Main category: cs.CL

TL;DR: Fun-Audio-Chat is a Large Audio Language Model that solves speech-text temporal mismatch and catastrophic forgetting issues through dual-resolution processing and core-cocktail training, achieving competitive performance on audio tasks while retaining text LLM knowledge.

Details

Motivation: Existing joint speech-text models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge.

Method: 1. Dual-Resolution Speech Representations (DRSR): Shared LLM processes audio at efficient 5Hz via token grouping, while Speech Refined Head generates high-quality tokens at 25Hz. 2. Core-Cocktail Training: Two-stage fine-tuning with intermediate merging to mitigate catastrophic forgetting. 3. Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy.

Result: Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy.

Conclusion: Fun-Audio-Chat successfully addresses temporal resolution mismatch and catastrophic forgetting issues through innovative architectural and training approaches, enabling retention of text LLM knowledge while gaining powerful audio understanding, reasoning, and generation capabilities without requiring large-scale audio-text pre-training.

Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .

[16] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

Amal Alqahtani, Efsun Kayi, Mona Diab

Main category: cs.CL

TL;DR: StressRoBERTa: a cross-condition transfer learning model that uses continual training on clinically related mental health conditions (depression, anxiety, PTSD) to improve chronic stress detection in English tweets, achieving 82% F1-score and outperforming previous systems.

Details

Motivation: Chronic stress is a major public health issue, and social media platforms like Twitter provide valuable data for understanding stress experiences. There's a need for better automated detection of self-reported chronic stress in social media text, leveraging the high comorbidity between chronic stress and other mental health conditions.

Method: Uses cross-condition transfer learning approach called StressRoBERTa. Continually trains RoBERTa on Stress-SMHD corpus (108M words from users with self-reported depression, anxiety, PTSD diagnoses), then fine-tunes on SMM4H 2022 Task 8 dataset for chronic stress detection. Compares against general language models and broad mental health models.

Result: StressRoBERTa achieves 82% F1-score on chronic stress detection, outperforming the best shared task system by 3 percentage points (79% F1). Shows +1% F1 improvement over vanilla RoBERTa. Also demonstrates transferability to situational stress discussions with 81% F1 on Dreaddit dataset.

Conclusion: Focused cross-condition transfer learning from stress-related disorders provides stronger representations than general mental health training for chronic stress detection. The approach successfully transfers from clinical mental health contexts to situational stress discussions, demonstrating the value of leveraging comorbid conditions in mental health NLP tasks.

Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.

[17] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

Himel Ghosh

Main category: cs.CL

TL;DR: Comparative interpretability study of transformer-based bias detection models using SHAP explanations reveals architectural differences affect reliability and false positive rates in journalistic contexts.

Details

Motivation: Automated bias detection models are widely used in journalism but lack transparency in how they make decisions or why they fail, creating a need for interpretability studies to understand their operational mechanisms.

Method: Comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on BABE dataset, using SHAP-based explanations to analyze word-level attributions across correct and incorrect predictions.

Result: Both models attend to similar evaluative language categories but integrate signals differently. The bias detector model shows misalignment between attribution strength and prediction correctness, systematically over-flagging neutral content. The domain-adaptive model produces 63% fewer false positives with better-aligned attribution patterns. False positives arise from discourse-level ambiguity rather than explicit bias cues.

Conclusion: Interpretability-aware evaluation is crucial for bias detection systems, and architectural/training choices critically affect model reliability and deployment suitability in journalistic contexts, with domain adaptation showing superior performance.

Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

[18] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

Dingmin Wang, Ji Ma, Shankar Kumar

Main category: cs.CL

TL;DR: Adaptive prompting strategy splits retrieved info into chunks for LLM question answering, balancing relevance vs. irrelevance while reducing token usage.

Details

Motivation: Longer context windows in LLMs for retrieval-augmented generation introduce more irrelevant information that degrades performance, despite making targeted knowledge incorporation easier.

Method: Design adaptive prompting strategy that splits retrieved information into smaller chunks and sequentially prompts LLM to answer questions using each chunk, with adjustable chunk size for trade-off control.

Result: Experimental results on three open-domain QA datasets show adaptive strategy matches standard prompting performance while using fewer tokens. Analysis reveals LLMs often generate incorrect answers instead of declining when faced with insufficient information.

Conclusion: The adaptive prompting approach effectively addresses the irrelevance problem in expanded contexts, but highlights the need for further research into improving LLMs’ ability to decline requests when information is inadequate.

Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model’s generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs’ ability to effectively decline requests when faced with inadequate information.

[19] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

Main category: cs.CL

TL;DR: Using attention layer token distributions to generate adversarial examples for LLM evaluation tasks, showing performance drops but with grammatical limitations.

Details

Motivation: Recent mechanistic interpretability research suggests intermediate attention layers encode token-level hypotheses that get refined. This property can be exploited to generate adversarial examples directly from model-internal token predictions, creating perturbations that are plausible and consistent with the model's own generation process.

Method: Extract tokens from intermediate attention layers to use as adversarial perturbations. Evaluate on argument quality assessment using ArgQuality dataset with LLaMA-3.1-Instruct-8B as both generator and evaluator. Unlike prompt-based or gradient-based attacks, this approach leverages model-internal token predictions.

Result: Attention-based adversarial examples cause measurable drops in evaluation performance while maintaining semantic similarity to original inputs. However, substitutions from certain layers and token positions introduce grammatical degradation, limiting practical effectiveness.

Conclusion: Intermediate-layer representations show promise as a principled source of adversarial examples for stress-testing LLM evaluation pipelines, but current limitations exist due to grammatical degradation issues.

Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model’s own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

[20] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Yukun Zhang, Stefan Elbl Droguett, Samyak Jain

Main category: cs.CL

TL;DR: A multi-retriever RAG system with domain-specific training (SecBERT) outperforms baseline models on financial numerical QA tasks, and prompt-based LLMs achieve SOTA performance but still lag behind human experts.

Details

Motivation: Financial numerical QA tasks are challenging due to the need for both domain-specific financial knowledge and complex multi-step numeric reasoning, which current LLMs lack despite recent advances.

Method: Implemented a multi-retriever RAG system that retrieves both external domain knowledge and internal question contexts, using domain-specific training with SecBERT encoder and leveraging the latest LLMs optimized for few-shot learning.

Result: Domain-specific training with SecBERT helped their best neural symbolic model surpass the FinQA baseline. Their best prompt-based LLM generator achieved SOTA performance with >7% improvement, though still below human expert level. Found trade-offs between hallucinations and external knowledge gains in different model sizes.

Conclusion: Domain-specific training significantly improves financial numerical reasoning, and while LLMs with external knowledge retrieval achieve strong results, there’s still a gap with human expertise. Larger models benefit more from external facts despite hallucination risks.

Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper’s top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

[21] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Conrad Borchers, Manit Patel, Seiyon M. Lee, Anthony F. Botelho

Main category: cs.CL

TL;DR: A framework separates content signals from rater tendencies in automated scoring of open-ended responses, using teacher histories as dynamic priors and text embeddings to make grading judgments visible and auditable.

Details

Motivation: Open-ended responses are crucial for learning assessment, but current automated scoring conflates student content with teacher grading tendencies, making it hard to distinguish what students actually understand from how teachers grade.

Method: Analytics-first framework using de-identified math responses from ASSISTments. Models teacher histories as dynamic priors, derives text representations from sentence embeddings, uses centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify signal contributions, with projection to surface model disagreements for qualitative inspection.

Result: Teacher priors heavily influence grade predictions. Best results when combining priors with content embeddings (AUC~~0.815), while content-only models are substantially weaker (AUC~~0.626). Adjusting for rater effects sharpens residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding vs. surface-level response differences.

Conclusion: The framework transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align or conflict with evidence of student reasoning and learning, making judgments visible and auditable.

Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~~0.815), while content-only models remain above chance but substantially weaker (AUC~~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

[22] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

Main category: cs.CL

TL;DR: HGMem introduces a hypergraph-based memory mechanism for multi-step RAG that captures higher-order correlations between facts, improving global reasoning compared to traditional passive memory storage.

Details

Motivation: Existing RAG memory modules function as passive storage that accumulates isolated facts, overlooking crucial high-order correlations among primitive facts. This static nature limits representational strength and impact on multi-step reasoning, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts.

Method: HGMem represents memory as a hypergraph where hyperedges correspond to distinct memory units, enabling progressive formation of higher-order interactions within memory. This connects facts and thoughts around the focal problem, evolving into an integrated knowledge structure that provides strong propositions for deeper reasoning.

Result: Extensive experiments on challenging datasets for global sense-making show that HGMem consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

Conclusion: HGMem extends memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding, addressing limitations of existing passive memory designs in multi-step RAG systems.

Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

[23] Efficient Context Scaling with LongCat ZigZag Attention

Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang, Feifan Liu, Guanyu Wu, Yuwei Jiang, Defei Bu, Li Wei, Haihang Jing, Hongyin Tang, Xin Chen, Xiangzhou Huang, Fengcun Li, Rongxiang Weng, Yulei Qian, Yifan Lu, Yerui Sun, Jingang Wang, Yuchen Xie, Xunliang Cai

Main category: cs.CL

TL;DR: LoZA is a sparse attention scheme that transforms full-attention models into sparse versions for efficient long-context processing, enabling up to 1M token handling with significant speed-ups.

Details

Motivation: To enable efficient long-context processing with limited compute budget, addressing the computational challenges of full attention in long-sequence scenarios for tasks like retrieval-augmented generation and tool-integrated reasoning.

Method: LongCat ZigZag Attention (LoZA) - a sparse attention scheme that transforms existing full-attention models into sparse versions. Applied to LongCat-Flash during mid-training to create LongCat-Flash-Exp.

Result: Achieves significant speed-ups in both prefill-intensive and decode-intensive cases. Enables processing up to 1 million tokens efficiently, supporting long-term reasoning and long-horizon agentic capabilities.

Conclusion: LoZA provides an effective sparse attention solution for long-context processing, making large-scale sequence handling computationally feasible while maintaining model capabilities for complex reasoning tasks.

Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

[24] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

Zhiming Lin, Kai Zhao, Sophie Zhang, Peilai Yu, Canran Xiao

Main category: cs.CL

TL;DR: CEC-Zero: A zero-supervision reinforcement learning framework for Chinese spelling correction that enables LLMs to self-correct without labeled data, outperforming supervised methods by 10-13 F1 points.

Details

Motivation: Large-scale Chinese spelling correction is crucial but existing LLMs and supervised methods lack robustness to novel errors and require expensive annotations. There's a need for a label-free approach that can handle real-world noisy text effectively.

Method: CEC-Zero uses reinforcement learning with zero supervision: synthesizes errorful inputs from clean text, computes cluster-consensus rewards based on semantic similarity and candidate agreement, and optimizes the policy with PPO (Proximal Policy Optimization).

Result: Outperforms supervised baselines by 10-13 F1 points and strong LLM fine-tunes by 5-8 points across 9 benchmarks. Provides theoretical guarantees of unbiased rewards and convergence.

Conclusion: Establishes a label-free paradigm for robust, scalable Chinese spelling correction, unlocking LLM potential in noisy text processing pipelines without requiring costly annotations.

Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10–13 F$_1$ points and strong LLM fine-tunes by 5–8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

[25] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang

Main category: cs.CL

TL;DR: RISE is an unsupervised framework using sparse auto-encoders to discover interpretable reasoning vectors in LLM activations, enabling analysis and control of reasoning behaviors without human supervision.

Details

Motivation: Current approaches to understanding LLM reasoning rely on human-defined concepts at the word level, which is limited because it's infeasible to capture all potential reasoning behaviors, many of which are difficult to define in token space.

Method: Propose RISE framework: segment chain-of-thought traces into sentence-level steps, train sparse auto-encoders (SAEs) on step-level activations to discover reasoning vectors (directions in activation space encoding distinct reasoning behaviors).

Result: SAEs uncover disentangled features corresponding to interpretable behaviors (reflection, backtracking) that occupy separable regions in decoder space. Interventions on SAE-derived vectors can controllably amplify/suppress reasoning behaviors. SAEs also capture structural properties (response length) and enable discovery of novel behaviors beyond human supervision (confidence-related vectors).

Conclusion: Unsupervised latent discovery via SAEs shows strong potential for both interpreting and controllably steering reasoning in LLMs, moving beyond limitations of supervised approaches.

Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level ‘steps’ and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

[26] WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Main category: cs.CL

TL;DR: The WISE framework benchmarks lightweight transformer models against baselines for distinguishing fake news from satire, finding MiniLM achieves highest accuracy (87.58%) while RoBERTa-base has best ROC-AUC (95.42%), with lightweight models offering practical efficiency-accuracy trade-offs for real-world deployment.

Details

Motivation: Distinguishing fake news from satire/humor is challenging due to overlapping linguistic features but different intent, requiring effective detection systems that can operate in resource-constrained real-world settings.

Method: Developed WISE framework benchmarking 8 lightweight transformer models against 2 baselines on balanced 20,000-sample dataset from Fakeddit, using stratified 5-fold cross-validation and comprehensive metrics (accuracy, precision, recall, F1, ROC-AUC, PR-AUC, MCC, Brier score, Expected Calibration Error).

Result: MiniLM achieved highest accuracy (87.58%), RoBERTa-base achieved highest ROC-AUC (95.42%) with strong accuracy (87.36%), DistilBERT offered excellent efficiency-accuracy trade-off (86.28% accuracy, 93.90% ROC-AUC). Statistical tests confirmed significant performance differences between models.

Conclusion: Lightweight transformer models can match or exceed baseline performance for fake news vs satire detection, providing actionable insights for deploying effective misinformation detection systems in resource-constrained real-world environments.

Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28% accuracy and 93.90% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

[27] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

Sijia Chen, Di Niu

Main category: cs.CL

TL;DR: iCLP enables LLMs to generate compact latent plans for step-by-step reasoning, improving accuracy, efficiency, and cross-domain generalization while maintaining interpretability.

Details

Motivation: Existing explicit textual planning for LLMs suffers from hallucinations and difficulty handling diverse task-specific questions. The paper draws inspiration from human implicit cognition - subconscious decision-making using compact patterns learned from experience without verbalization.

Method: iCLP framework: 1) Distills explicit plans from existing step-by-step reasoning trajectories, 2) Learns discrete representations via vector-quantized autoencoder with codebook, 3) Fine-tunes LLMs on paired latent plans and reasoning steps to enable implicit planning during reasoning.

Result: Experimental results on mathematical reasoning and code generation show LLMs can plan in latent space while reasoning in language space. Significant improvements in accuracy and efficiency, with strong cross-domain generalization while preserving chain-of-thought interpretability.

Conclusion: iCLP successfully enables LLMs to perform implicit planning similar to human subconscious cognition, addressing limitations of explicit textual planning while maintaining the benefits of interpretable step-by-step reasoning.

Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.

[28] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Rohit Kumar Salla, Manoj Saravanan, Shrikar Reddy Kota

Main category: cs.CL

TL;DR: The paper introduces CRS, a unified framework that combines calibration, robustness, and uncertainty quantification into a single interpretable metric to evaluate LLM reliability in decision-critical domains.

Details

Motivation: LLMs are increasingly used in high-stakes domains like healthcare, law, and finance, but their reliability is uncertain. They make overconfident errors, degrade under input shifts, lack clear uncertainty estimates, and existing evaluations are fragmented, addressing only isolated aspects of reliability.

Method: The Composite Reliability Score (CRS) integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Experiments were conducted on ten leading open-source LLMs across five QA datasets, assessing performance under baselines, perturbations, and calibration methods.

Result: CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and reveals that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

Conclusion: CRS provides a comprehensive framework for evaluating LLM reliability, showing that reliable systems require balanced performance across multiple dimensions rather than just high accuracy.

Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

[29] HY-MT1.5 Technical Report

Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, Di Wang

Main category: cs.CL

TL;DR: HY-MT1.5 models (1.8B and 7B parameters) achieve state-of-the-art machine translation performance, rivaling much larger models and commercial APIs through a comprehensive training framework.

Details

Motivation: To develop highly parameter-efficient machine translation models that can compete with significantly larger open-source baselines and commercial translation APIs while supporting advanced translation features.

Method: A holistic multi-stage training framework integrating general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning.

Result: HY-MT1.5-1.8B outperforms much larger models (Tower-Plus-72B, Qwen3-32B) and commercial APIs, achieving ~90% of Gemini-3.0-Pro’s performance. HY-MT1.5-7B sets new SOTA for its size class, achieving 95% of Gemini-3.0-Pro’s performance on Flores-200 and surpassing it on WMT25 and Mandarin-minority language benchmarks.

Conclusion: The HY-MT1.5 series provides highly competitive, robust translation solutions with advanced constraint support, demonstrating exceptional parameter efficiency and performance across general and specialized translation tasks.

Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro’s performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

[30] Training a Huggingface Model on AWS Sagemaker (Without Tears)

Liling Tan

Main category: cs.CL

TL;DR: A tutorial paper that centralizes essential information to help researchers train Hugging Face models on AWS SageMaker, addressing the steep learning curve of cloud platforms.

Details

Motivation: LLM development is dominated by resource-rich groups, forcing many researchers to use cloud services like AWS SageMaker. However, the steep learning curve and fragmented documentation create barriers for researchers accustomed to local environments.

Method: The paper provides a centralized, comprehensive tutorial/demo that guides researchers step-by-step through training their first Hugging Face model on AWS SageMaker from scratch, filling knowledge gaps left by existing documentation.

Result: A practical guide that democratizes cloud adoption by providing all essential information in one place, enabling researchers to successfully train Hugging Face models on AWS SageMaker without needing to search fragmented information across the web.

Conclusion: By centralizing cloud training knowledge, this demo paper lowers the barrier to entry for researchers, making cloud-based LLM training more accessible and helping democratize access to computational resources for AI research.

Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

[31] Activation Steering for Masked Diffusion Language Models

Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid

Main category: cs.CL

TL;DR: MDLMs lack inference-time control; proposed activation-steering framework uses contrastive examples to compute layer-wise steering vectors for efficient attribute modulation without trajectory simulation.

Details

Motivation: Masked diffusion language models have gained attention for parallel decoding and competitive performance, but lack effective inference-time control and steering mechanisms.

Method: Activation-steering framework computes layer-wise steering vectors from single forward pass using contrastive examples, applied at every reverse-diffusion step without simulating denoising trajectory.

Result: Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining effects across transformer sub-modules and token scope (prompt vs. response).

Conclusion: Proposed framework provides efficient inference-time control for MDLMs, enabling reliable attribute modulation through activation steering without complex trajectory simulation.

Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

[32] Large Emotional World Model

Changhao Song, Yazhou Zhang, Hui Gao, Chang Yang, Peng Zhang

Main category: cs.CL

TL;DR: The paper proposes LEWM, a Large Emotional World Model that incorporates emotional states alongside visual observations and actions to better predict emotion-driven social behaviors, addressing the limitation of existing LLMs that focus mainly on physical-world regularities.

Details

Motivation: Existing Large Language Models primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors, despite emotion being a key component of world knowledge that significantly influences human decision-making. The authors demonstrate that removing emotionally relevant information degrades reasoning performance.

Method: The authors propose LEWM (Large Emotional World Model) inspired by theory of mind. They construct the Emotion-Why-How (EWH) dataset that integrates emotion into causal relationships, enabling reasoning about why actions occur and how emotions drive future world states. LEWM explicitly models emotional states alongside visual observations and actions to predict both future states and emotional transitions.

Result: Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

Conclusion: Incorporating emotional modeling into world models is important for better understanding and predicting human social behaviors, and LEWM demonstrates the feasibility and effectiveness of this approach.

Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

[33] Training Report of TeleChat3-MoE

Xinzhang Liu, Chao Wang, Zhihao Yang, Zhuo Jiang, Xuncheng Zhao, Haoran Wang, Lei Li, Dongdong He, Luobin Liu, Kaizhe Yuan, Han Gao, Zihan Wang, Yitong Yao, Sishi Xiong, Wenmin Deng, Haowei He, Kaidong Yu, Yu Zhao, Ruiyu Fang, Yuhao Jiang, Yingyan Li, Xiaohui Hu, Xi Yu, Jingqi Li, Yanwei Liu, Qingli Li, Xinyu Shi, Junhao Niu, Chengnuo Huang, Yao Xiao, Ruiwen Wang, Fengkai Li, Luwen Pu, Kaipeng Jia, Fubei Yao, Yuyao Huang, Xuewei He, Zhuoru Jiang, Ruiting Song, Rui Xue, Qiyi Xie, Jie Zhang, Zilu Huang, Zhaoxi Zhang, Zhilong Lu, Yanhan Zhang, Yin Zhang, Yanlei Xue, Zhu Yuan, Teng Su, Xin Jiang, Shuangyong Song, Yongxiang Li, Xuelong Li

Main category: cs.CL

TL;DR: TeleChat3-MoE is a series of large language models with MoE architecture (105B-1T+ parameters) trained on Ascend NPU clusters, with this report focusing on the training infrastructure enabling reliable scaling to frontier model sizes.

Details

Motivation: To enable reliable and efficient scaling of large language models to frontier sizes (over 1 trillion parameters) on specialized hardware (Ascend NPU clusters), addressing the infrastructure challenges of training such massive MoE models.

Method: Systematic methodologies for operator-level and end-to-end numerical accuracy verification; performance optimizations including interleaved pipeline scheduling, attention-aware data scheduling for long sequences, hierarchical/overlapped communication for expert parallelism, and DVM-based operator fusion; systematic parallelization framework using analytical estimation and integer linear programming; cluster-level optimizations addressing host- and device-bound bottlenecks.

Result: Significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

Conclusion: The infrastructure advancements presented enable reliable and efficient scaling to frontier model sizes, demonstrating that large-scale language model development can be successfully achieved on specialized hardware ecosystems through systematic optimization approaches.

Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

[34] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Qipeng Wang, Rui Sheng, Yafei Li, Huamin Qu, Yushi Sun, Min Zhu

Main category: cs.CL

TL;DR: MedKGI is a clinical diagnostic framework that addresses LLM limitations in medical reasoning by integrating knowledge graphs for grounding, information gain for efficient questioning, and structured state tracking for coherence.

Details

Motivation: Current LLMs struggle with clinical diagnosis due to three key limitations: generating hallucinated medical content, asking inefficient questions, and losing coherence in multi-turn dialogues, which prevents them from emulating real clinical diagnostic reasoning.

Method: MedKGI integrates a medical knowledge graph to constrain reasoning to validated ontologies, selects questions based on information gain to maximize diagnostic efficiency, and uses an OSCE-format structured state to maintain consistent evidence tracking across dialogue turns.

Result: Experiments show MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

Conclusion: MedKGI successfully addresses critical limitations of LLMs in clinical diagnosis by grounding reasoning in verified knowledge, optimizing question selection, and maintaining dialogue coherence, making it a promising framework for clinical applications.

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

[35] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

May Bashendy, Walid Massoud, Sohaila Eltanbouly, Salam Albatarni, Marwan Sayed, Abrar Abir, Houda Bouamor, Tamer Elsayed

Main category: cs.CL

TL;DR: LAILA is the largest publicly available Arabic Automated Essay Scoring dataset with 7,859 essays annotated across seven dimensions, addressing the lack of Arabic AES resources.

Details

Motivation: Arabic AES research is limited due to lack of publicly available datasets, creating a need for comprehensive Arabic essay scoring resources to support development of robust scoring systems.

Method: Created LAILA dataset with 7,859 Arabic essays annotated with holistic and trait-specific scores across seven dimensions (relevance, organization, vocabulary, style, development, mechanics, grammar), then benchmarked using state-of-the-art Arabic and English models in both prompt-specific and cross-prompt settings.

Result: LAILA fills critical gap in Arabic AES research by providing largest publicly available dataset, enabling benchmark comparisons and supporting development of Arabic essay scoring systems.

Conclusion: LAILA addresses the scarcity of Arabic AES datasets and will facilitate advancement of Arabic automated essay scoring research and system development.

Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

[36] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

Michael E. Rose, Mainak Ghosh, Sebastian Erhardt, Cheng Li, Erik Buunk, Dietmar Harhoff

Main category: cs.CL

TL;DR: Pat-SPECTER model (SPECTER2 fine-tuned on patents) outperforms other language similarity models in predicting credible patent-paper citations, and is used to analyze citation patterns across jurisdictions.

Details

Motivation: Need for a language similarity model that can effectively handle both patents and scientific publications simultaneously, to better understand and predict credible citations between these two document types.

Method: Develop Pat-SPECTER by fine-tuning SPECTER2 model on patent data. Evaluate eight language similarity models in a horse race-style comparison using credible Patent-Paper Citations as prediction target. Test model in two real-world scenarios: separating patent-paper-pairs and predicting patent-paper-pairs.

Result: Pat-SPECTER performs best among eight tested models. The model successfully demonstrates capabilities in real-world scenarios. Analysis reveals US patents cite papers that are semantically less similar than in other large jurisdictions, potentially due to duty of candor requirements.

Conclusion: Pat-SPECTER is an effective language similarity model for patent-paper analysis, outperforming existing models. The open-source model provides valuable tool for academic and practical applications, and reveals interesting jurisdictional differences in citation patterns.

Abstract: We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.

[37] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Ziqing Fan, Yuqiao Xian, Yan Sun, Li Shen

Main category: cs.CL

TL;DR: DATAMASK is an efficient joint learning framework for large-scale pre-training data selection that simultaneously optimizes quality and diversity metrics, achieving significant performance improvements with only 10% of data.

Details

Motivation: Current data selection methods for trillion-scale pre-training datasets either focus on quality metrics (which show diminishing returns) or diversity metrics (which remove valuable high-quality samples), limiting LLM capabilities. There's a need for an efficient method that jointly optimizes both metrics.

Method: DATAMASK treats data selection as a mask learning problem using policy gradient optimization. It involves iterative sampling of data masks, computing policy gradients based on predefined objectives, and updating mask sampling logits with various acceleration enhancements.

Result: DATAMASK reduces selection time by 98.9% compared to greedy algorithms. Using only 10% of the 15 trillion-token FineWeb dataset (FineWeb-Mask), it achieves 3.2% improvement on a 1.5B dense model and 1.9% on a 7B MoE model across 12 diverse tasks.

Conclusion: DATAMASK enables efficient joint optimization of quality and diversity metrics for trillion-scale data selection, overcoming limitations of single-metric approaches and significantly improving model performance with much smaller datasets.

Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs’ capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

[38] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Jonathan Schmoll, Adam Jatowt

Main category: cs.CL

TL;DR: LLMs show limited success in EU Taxonomy compliance automation, performing moderately on qualitative tasks but failing on quantitative KPI prediction, with dataset and benchmarks provided for future research.

Details

Motivation: The EU Taxonomy compliance process is manual and resource-intensive, but LLM automation research is hindered by lack of public benchmark datasets for systematic evaluation.

Method: Created a structured dataset from 190 corporate reports with ground-truth economic activities and KPIs, then conducted systematic evaluation of LLMs on core compliance workflow including qualitative activity identification and quantitative KPI prediction.

Result: LLMs show moderate success in qualitative economic activity identification (improved with multi-step agentic framework) but comprehensively fail at quantitative KPI prediction in zero-shot settings. Paradoxically, concise metadata often outperforms full unstructured reports, and model confidence scores are poorly calibrated.

Conclusion: LLMs are not ready for full automation of EU Taxonomy compliance but can serve as powerful assistive tools for human experts. The dataset provides a public benchmark for future research.

Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

[39] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Meiqi Chen, Fandong Meng, Jie Zhou

Main category: cs.CL

TL;DR: FIGR integrates visual thinking into reasoning via reinforcement learning, constructing visual representations during problem solving to handle implicit spatial/structural relationships that text alone struggles with.

Details

Motivation: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. Text-based reasoning struggles to represent global structural constraints in complex settings.

Method: FIGR integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. It externalizes intermediate structural hypotheses by constructing visual representations during problem solving, and adaptively regulates when and how visual reasoning should be invoked.

Result: FIGR outperforms strong text-only chain-of-thought baselines on challenging mathematical reasoning benchmarks. It improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME.

Conclusion: Figure-guided multimodal reasoning enhances the stability and reliability of complex reasoning by enabling more stable and coherent reasoning over global structural properties that are difficult to capture from text alone.

Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

[40] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

Shupeng Li, Weipeng Lu, Linyun Liu, Chen Lin, Shaofei Li, Zhendong Tan, Hanjun Zhong, Yucheng Zeng, Chenghao Zhu, Mengyue Liu, Daxiang Dong, Jianmin Wu, Yunting Xiao, Annan Li, Danyu Liu, Jingnan Zhang, Licen Liu, Dawei Yin, Dou Shen

Main category: cs.CL

TL;DR: QianfanHuijin is a financial LLM using multi-stage training with continual pre-training on financial corpora followed by progressive post-training (SFT, reasoning RL, agentic RL, general RL) to enhance both domain knowledge and reasoning/agentic capabilities.

Details

Motivation: Previous financial LLMs focused mainly on knowledge enhancement, but increasing complexity of financial services requires models with robust financial reasoning and agentic capabilities alongside domain knowledge.

Method: Multi-stage training paradigm: 1) Continual Pre-training on financial corpora, 2) Progressive Post-training pipeline with Financial SFT → Finance Reasoning RL → Finance Agentic RL → General RL aligned with real-world business scenarios.

Result: QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Ablation studies confirm Reasoning RL and Agentic RL stages yield significant gains in respective capabilities.

Conclusion: The fine-grained, progressive post-training methodology is validated and poised to become a mainstream paradigm for various industrial-enhanced LLMs, addressing the need for both knowledge and reasoning/agentic capabilities in financial services.

Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.

[41] World model inspired sarcasm reasoning with large language model agents

Keito Inoshita, Shinnosuke Mizuno

Main category: cs.CL

TL;DR: WM-SAR reformulates sarcasm understanding as world model reasoning, decomposing literal meaning, context, normative expectation, and intention into specialized LLM agents, then computes inconsistency and intention scores for interpretable sarcasm detection.

Details

Motivation: Existing sarcasm detection approaches rely on black-box predictions of single models, lacking structural explanations of cognitive factors. Sarcasm often emerges from mismatches between semantic evaluation and normative expectations/intentions, but frameworks explicitly modeling these components are limited.

Method: Proposes World Model inspired SArcasm Reasoning (WM-SAR) with specialized LLM-based agents for literal meaning, context, normative expectation, and intention. Quantifies discrepancy between literal evaluation and normative expectation as deterministic inconsistency score, combines with intention score, and integrates via lightweight Logistic Regression for final sarcasm probability.

Result: WM-SAR consistently outperforms existing deep learning and LLM-based methods on representative sarcasm detection benchmarks. Ablation studies show integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

Conclusion: The world model inspired reasoning approach successfully decomposes sarcasm understanding into interpretable components while leveraging LLM reasoning capabilities. The framework achieves state-of-the-art performance with high interpretability through explicit modeling of semantic inconsistency and intention reasoning.

Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker’s intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

[42] Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro, Zied Bouraoui

Main category: cs.CL

TL;DR: A self-supervised contrastive learning framework for long document representation that mimics human skimming by masking sections and using NLI-based contrastive objectives to align relevant parts while distancing unrelated ones.

Details

Motivation: Existing transformer models struggle with long document representation in specialized fields like law and medicine. Sparse attention is resource-intensive, hierarchical transformers lack clear inter-section relationships, while humans effectively skim and synthesize information from important sections.

Method: Proposes a self-supervised contrastive learning framework that randomly masks document sections and uses natural language inference (NLI)-based contrastive objectives to align masked sections with relevant parts while distancing them from unrelated sections, mimicking human skimming behavior.

Result: Experiments on legal and biomedical texts show significant improvements in both accuracy and computational efficiency compared to existing approaches.

Conclusion: The human-inspired skimming approach to document representation through contrastive learning with NLI objectives produces richer, more efficient representations for long documents in specialized domains.

Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

[43] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Chester Palen-Michel, Constantine Lignos

Main category: cs.CL

TL;DR: This paper compares various approaches to automatic text summarization for less-resourced languages, finding that fine-tuned mT5 outperforms most methods including zero-shot LLMs, and that LLM-as-judge evaluation may be unreliable for these languages.

Details

Motivation: While automatic text summarization has achieved high performance in high-resourced languages like English, there has been comparatively less attention given to summarization in less-resourced languages, creating a research gap that needs to be addressed.

Method: The study compares multiple approaches: 1) zero-shot prompting of various sized LLMs, 2) fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer, and 3) an LLM translation pipeline (translate to English → summarize → translate back). Evaluation uses five different metrics.

Result: Key findings: 1) Variation exists across LLMs in performance across similar parameter sizes, 2) Multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, 3) LLM-as-judge evaluation may be less reliable on less-resourced languages.

Conclusion: For less-resourced language summarization, fine-tuned multilingual models like mT5 are more effective than zero-shot LLM approaches, and careful consideration of evaluation methods is needed as LLM-as-judge may not be reliable for these languages.

Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

[44] Cleaning English Abstracts of Scientific Publications

Michael E. Rose, Nils A. Herrmann, Sebastian Erhardt

Main category: cs.CL

TL;DR: A language model for cleaning scientific abstracts by removing extraneous content like copyright statements and metadata to improve downstream text analysis.

Details

Motivation: Scientific abstracts often contain extraneous information (copyright statements, section headings, author notes, metadata) that distorts downstream analyses like document similarity and textual embeddings.

Method: Developed an open-source, easy-to-integrate language model designed to automatically identify and remove clutter from English-language scientific abstracts.

Result: The model is conservative and precise, alters similarity rankings of cleaned abstracts, and improves information content of standard-length embeddings.

Conclusion: The introduced language model effectively cleans scientific abstracts, enhancing the quality of text-based analyses by removing distorting extraneous content.

Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

[45] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

Titas Ramancauskas, Kotryna Ramancauske

Main category: cs.CL

TL;DR: A revision platform for IELTS writing exam preparation using transformer-based AES with adaptive feedback shows significant score improvements, though best as supplement to human instruction.

Details

Motivation: Traditional IELTS preparation lacks personalized feedback tailored to the IELTS writing rubric, creating a need for automated systems that can provide targeted, rubric-aligned feedback.

Method: Design-Based Research (DBR) approach with iterative cycles: early rule-based systems, then DistilBERT transformer model with regression head for scoring, followed by adaptive feedback implementation. Platform separates conversational guidance from writing interface to reduce cognitive load.

Result: Transformer model achieved MAE of 0.66 and positive R². Adaptive feedback showed statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen’s d = 0.504). Conservative surface-level corrections more reliable than aggressive structural interventions.

Conclusion: Automated feedback functions best as supplement to human instruction. Challenges remain in assessing higher-band essays. Future work needs longitudinal studies with real IELTS candidates and validation from official examiners.

Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5’s adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen’s d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

[46] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski, Alexander Waibel

Main category: cs.CL

TL;DR: This paper introduces paragraph segmentation for speech transcripts, creates two benchmarks (TEDPara and YTSegPara), proposes a constrained-decoding method for LLMs to insert paragraph breaks while preserving original text, and develops MiniSeg model that achieves SOTA accuracy and hierarchical chapter/paragraph prediction.

Details

Motivation: Automatic speech transcripts are unstructured word streams that hinder readability and repurposing. Paragraph segmentation is missing as a structuring step in speech processing, and there's a lack of robust benchmarks for this task in the speech domain.

Method: 1) Created TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as first benchmarks for paragraph segmentation in speech. 2) Proposed constrained-decoding formulation that enables large language models to insert paragraph breaks while preserving original transcripts for faithful evaluation. 3) Developed MiniSeg, a compact model that achieves state-of-the-art accuracy and can be extended hierarchically to jointly predict chapters and paragraphs.

Result: Established first benchmarks for paragraph segmentation in speech domain, demonstrated that MiniSeg attains state-of-the-art accuracy, and showed hierarchical extension can jointly predict chapters and paragraphs with minimal computational cost.

Conclusion: The paper establishes paragraph segmentation as a standardized, practical task in speech processing through new benchmarks, constrained-decoding methods, and efficient models, bridging the gap between speech processing and text segmentation.

Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.

[47] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said, Muhammad Sammani Sani

Main category: cs.CL

TL;DR: LLM safety alignment doesn’t transfer zero-shot across languages; HausaSafety dataset reveals complex interference patterns where safety depends on language-temporal variable intersections, not simple degradation in low-resource settings.

Details

Motivation: The dangerous assumption that safety alignment transfers zero-shot from English to other languages in LLMs integrated into critical global infrastructure, particularly exposing Global South users to localized harms.

Method: Systematic audit of GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus using HausaSafety dataset (West African threat scenarios) with 2x4 factorial design across 1,440 evaluations testing language (English vs. Hausa) and temporal framing interactions.

Result: Found Complex Interference mechanism instead of simple degradation; Reverse Linguistic effect (Claude safer in Hausa than English); profound Temporal Asymmetry (past-tense bypasses defenses, future-triggers hyper-conservative refusals); 9.2x disparity between safest/most vulnerable configurations.

Conclusion: Current models rely on superficial heuristics creating Safety Pockets, leaving Global South users exposed; propose Invariant Alignment paradigm shift for safety stability across linguistic and temporal shifts.

Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[48] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Chaodong Tong, Qi Zhang, Jiayang Gao, Lei Jiang, Yanbing Liu, Nannan Sun

Main category: cs.CL

TL;DR: HaluNet is a lightweight neural framework that integrates token-level probability uncertainty with semantic representation uncertainty for efficient hallucination detection in LLM-based QA systems.

Details

Motivation: LLMs often generate hallucinations (factual errors or fabricated content) in QA tasks. Existing hallucination detection methods focus on single uncertainty types and overlook the complementarity between token-level probability uncertainty and internal semantic representation uncertainty.

Method: HaluNet uses a multi-branch neural architecture that integrates multi-granular token-level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. It adaptively fuses model knowledge with output uncertainty for efficient one-pass hallucination detection.

Result: Experiments on SQuAD, TriviaQA, and Natural Questions show HaluNet delivers strong detection performance and favorable computational efficiency, working with or without access to context.

Conclusion: HaluNet demonstrates potential for real-time hallucination detection in LLM-based QA systems by effectively leveraging complementary uncertainty sources for scalable, resource-independent detection.

Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.

[49] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities

Hongseok Oh, Wonseok Hwang, Kyoung-Woon On

Main category: cs.CL

TL;DR: KCL is a Korean legal reasoning benchmark that separates reasoning ability from legal knowledge by providing question-level precedents, with MCQA and essay components showing large performance gaps.

Details

Motivation: To create a benchmark that assesses language models' legal reasoning capabilities independently of domain-specific legal knowledge, enabling more faithful evaluation of pure reasoning ability.

Method: Developed KCL with two components: KCL-MCQA (283 multiple-choice questions with 1,103 precedents) and KCL-Essay (169 open-ended questions with 550 precedents and 2,739 rubrics for automated evaluation).

Result: Evaluation of 30+ models shows large remaining performance gaps, especially in KCL-Essay, with reasoning-specialized models consistently outperforming general-purpose models.

Conclusion: KCL successfully disentangles legal reasoning from knowledge, revealing significant challenges in legal reasoning for current language models, with resources publicly available for further research.

Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models’ legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

[50] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Zhenyu Zhang, Xiaoxia Wu, Zhongzhu Zhou, Qingyang Wu, Yineng Zhang, Pragaash Ponnusamy, Harikaran Subbaraj, Jue Wang, Shuaiwen Leon Song, Ben Athiwaratkun

Main category: cs.CL

TL;DR: CREST is a training-free method that steers LLM reasoning by identifying and suppressing inefficient cognitive behaviors in attention heads, improving accuracy while reducing token usage.

Details

Motivation: Current LLMs using chain-of-thought reasoning suffer from inefficiency (high latency from excessive tokens) and instability (alternating between underthinking and overthinking), leading to unreliable and computationally expensive reasoning.

Method: CREST has two components: (1) offline calibration to identify cognitive attention heads correlated with specific reasoning behaviors (verification, backtracking) and derive head-specific steering vectors, and (2) inference-time procedure that rotates hidden representations to suppress components along those vectors, adaptively steering away from unproductive reasoning.

Result: Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, demonstrating both higher accuracy and lower computational cost.

Conclusion: CREST offers a simple, effective, training-free pathway to faster and more reliable LLM reasoning by steering cognitive behaviors at test-time through targeted interventions on specialized attention heads.

Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

[51] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan

Main category: cs.CL

TL;DR: Youtu-LLM is a 1.96B parameter lightweight language model pre-trained from scratch with native agentic intelligence, featuring long-context support and a progressive training curriculum that achieves SOTA performance for sub-2B models.

Details

Motivation: To create a lightweight language model that doesn't rely on distillation but instead develops intrinsic reasoning and planning capabilities from scratch, addressing the gap where small models typically lack native agentic intelligence.

Method: Three key technical advancements: 1) Compact MLA architecture with STEM-oriented vocabulary supporting 128k context window, 2) Multi-stage training curriculum shifting from commonsense to STEM to agentic tasks using 11T tokens, 3) Scalable agentic mid-training with diverse data construction for math, coding, and tool-use trajectories.

Result: Youtu-LLM sets new SOTA for sub-2B LLMs, achieving competitive performance on general benchmarks against larger models and significantly surpassing existing baselines on agent-specific tasks.

Conclusion: Lightweight models can possess strong intrinsic agentic capabilities when properly designed and trained with systematic curriculum and architectural innovations, challenging the assumption that small models must rely on distillation or lack native reasoning abilities.

Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled “Commonsense-STEM-Agent” Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

[52] Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan, Sid Black, Oliver Sourbut

Main category: cs.CL

TL;DR: LLMs are overconfident in predicting their own task success, with overconfidence worsening during multi-step tasks, but some can learn from failure experiences to improve decision-making.

Details

Motivation: To investigate whether LLMs can accurately predict their own success on tasks, whether these predictions improve during multi-step tasks, and whether they can learn from in-context failure experiences to make better decisions in costly-failure scenarios.

Method: Tested multiple LLMs on their ability to predict task success, examined how predictions change during multi-step agentic tasks, and evaluated whether in-context failure experiences improve decision-making about pursuing costly tasks.

Result: All LLMs were overconfident but had better-than-random discriminatory power. Newer/larger models didn’t generally have better discriminatory power (except Claude). Overconfidence worsened during multi-step tasks, and reasoning models performed similarly or worse than non-reasoning ones. Some LLMs reduced overconfidence after failure experiences, improving decision-making, while others didn’t. All LLMs made rational decisions given their estimated probabilities, but poor calibration led to bad decisions.

Conclusion: Current LLM agents lack awareness of their own capabilities, hindering their performance. This has implications for AI misuse and misalignment risks, suggesting that improving LLMs’ self-awareness could enhance their reliability and safety.

Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs’ decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs’ awareness of their capabilities for AI misuse and misalignment risks.

[53] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Maoyuan Li, Zhongsheng Wang, Haoyuan Li, Jiamou Liu

Main category: cs.CL

TL;DR: R-Debater is an agentic framework for multi-turn debates that uses argumentative memory to recall and adapt prior arguments, achieving better performance than LLM baselines in both single-turn and multi-turn debate tasks.

Details

Motivation: The paper addresses the need for more consistent, evidence-based, and coherent multi-turn debates in AI systems. Current approaches often lack stance consistency, proper evidence use, and coherent argument adaptation across turns.

Method: R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances. It’s grounded in rhetoric and memory studies, viewing debate as recalling and adapting prior arguments.

Result: R-Debater achieves higher scores than strong LLM baselines on both next-utterance generation (measured by InspireScore) and adversarial multi-turn simulations (measured by Debatrix). Human evaluation with 20 experienced debaters confirms its consistency and evidence use.

Conclusion: Combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns, demonstrating the value of argumentative memory in debate systems.

Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

[54] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li, Shujian Zhang, Wenxuan Zhou, John Lambert, Chi Jin, Andrew Hard, Rajiv Mathews, Lun Wang

Main category: cs.CL

TL;DR: MUSIC is an unsupervised data augmentation method that creates multi-turn contrastive conversation pairs to train better multi-turn reward models, outperforming baselines while maintaining single-turn performance.

Details

Motivation: Current multi-turn conversation evaluation relies heavily on costly human assessment. While multi-turn reward models offer a scalable alternative, existing training data (typically contrasting only final-turn responses) fails to capture the nuances of multi-turn interactions, limiting evaluation quality.

Method: Proposes MUSIC (Multi-Step Instruction Contrast), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs with differences spanning multiple turns. Applied to the Skywork preference dataset to train a multi-turn RM based on Gemma-2-9B-Instruct.

Result: The MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, without compromising performance on standard single-turn RM benchmarks.

Conclusion: Incorporating multi-turn contrasts is critical for building robust multi-turn reward models. MUSIC provides an effective unsupervised approach to enhance multi-turn evaluation capabilities while maintaining single-turn performance.

Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

[55] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

Sibo Wei, Peng Chen, Lifeng Dong, Yin Luo, Lei Wang, Peng Zhang, Wenpeng Lu, Jianbin Guo, Hongjun Yang, Dajun Zeng

Main category: cs.CL

TL;DR: BIOME-Bench: A new benchmark for evaluating LLMs on multi-omics pathway analysis tasks, revealing current models’ limitations in biomolecular interaction inference and pathway mechanism elucidation.

Details

Motivation: Existing pathway enrichment methods have structural limitations (curation lag, redundancy, limited sensitivity), and while LLMs show promise for improving interpretation, there's no standardized benchmark for evaluating end-to-end multi-omics pathway mechanism elucidation, hindering reproducible progress.

Method: Developed BIOME-Bench through a rigorous four-stage workflow to evaluate two core LLM capabilities: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. Created evaluation protocols and conducted comprehensive experiments across multiple contemporary models.

Result: Existing models show substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and generate faithful, robust pathway-level mechanistic explanations.

Conclusion: BIOME-Bench addresses the critical need for standardized evaluation in LLM-based multi-omics analysis, revealing significant gaps in current models’ capabilities and providing a foundation for future improvements in pathway mechanism elucidation.

Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.

[56] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

Mohammad Zia Ur Rehman, Velpuru Navya, Sanskar, Shuja Uddin Qureshi, Nagendra Kumar

Main category: cs.CL

TL;DR: Semi-SMDNet: A semi-supervised multilingual depression detection framework using teacher-student pseudo-labeling, ensemble learning, and data augmentation to overcome language style variations and limited annotated data.

Details

Motivation: Depression detection from social media text is challenging due to different language styles, informal expressions, and lack of annotated data in many languages, creating a need for scalable cross-language solutions with limited labeled resources.

Method: Proposes Semi-SMDNet with teacher-student pseudo-labeling, ensemble learning (multiple teacher models with soft voting), uncertainty-based threshold filtering for low-confidence pseudo-labels, confidence-weighted training focusing on reliable samples, and data augmentation.

Result: Outperforms strong baselines on Arabic, Bangla, English, and Spanish datasets, significantly reduces performance gap between resource-rich and resource-poor settings, and demonstrates effectiveness across various situations.

Conclusion: The framework is suitable for scalable, cross-language mental health monitoring where labeled resources are limited, offering robust multilingual depression detection from social media text.

Abstract: Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.

[57] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Ákos Prucs, Márton Csutora, Mátyás Antal, Márk Marosi

Main category: cs.CL

TL;DR: LLMs show improved reasoning with intermediate steps, but this comes with high computational costs. The paper evaluates LLMs considering both accuracy and inference costs, finding MoE architectures balance performance/efficiency best, and identifies diminishing returns beyond certain compute thresholds.

Details

Motivation: Current LLM research focuses on reasoning accuracy but overlooks the computational burden of generating long reasoning sequences. For industrial applications, model selection must consider both performance and resource constraints/inference costs.

Method: Conducted test-time-compute aware evaluation of contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Analyzed Mixture of Experts (MoE) architecture and traced Pareto efficiency trends over time.

Result: Mixture of Experts (MoE) architecture emerges as strong candidate for balancing performance and efficiency. Identified emergent trend of accuracy gain per unit of compute, and demonstrated saturation point for inference-time compute where accuracy gains diminish beyond certain thresholds.

Conclusion: While extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities. There’s a compute threshold beyond which additional reasoning steps yield diminishing returns, making efficiency-aware evaluation crucial for industrial applications.

Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

[58] Practising responsibility: Ethics in NLP as a hands-on course

Malvina Nissim, Viviana Patti, Beatrice Savoldi

Main category: cs.CL

TL;DR: A course on Ethical Aspects in NLP using active learning methods that has been refined over four years across different institutions and produced reusable educational materials.

Details

Motivation: As NLP systems become more pervasive, there's a growing need to integrate ethical considerations into NLP education, but this faces challenges due to the field's rapid evolution and the need to move beyond traditional technical training to foster critical thinking.

Method: Pedagogical approach grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. The course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds over four years.

Result: The course has yielded many reusable products including teaching materials and actual educational products created by students for diverse audiences. It has been successfully implemented and refined across various educational contexts.

Conclusion: By sharing their approach and experience, the authors hope to inspire other educators to incorporate social impact considerations into their curricula, addressing the essential need for ethical education in NLP.

Abstract: As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field’s rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

[59] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Yanan Long

Main category: cs.CL

TL;DR: Triangulation: A causal standard for evaluating multilingual model circuits that requires necessity, sufficiency, and invariance across reference families.

Details

Motivation: Multilingual models show unpredictable behavior across languages/scripts/cultures, and current mechanistic explanations lack causal rigor and cross-lingual validation.

Method: Formalize reference families as predicate-preserving variants, introduce triangulation acceptance rule (necessity, sufficiency, invariance), use automatic circuit discovery with triangulation filtering, ground in causal abstraction via interchange interventions.

Result: Triangulation provides falsifiable standard that filters spurious circuits passing single-environment tests but failing cross-lingual invariance, demonstrated across multiple model families, language pairs, and tasks.

Conclusion: Triangulation offers rigorous causal standard for mechanistic claims about multilingual models, connecting to pragmatic interpretability agenda and enabling more reliable cross-lingual circuit validation.

Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

[60] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Srija Mukhopadhyay, Sathwik Reddy, Shruthi Muthukumar, Jisun An, Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: PrivacyBench benchmark reveals RAG assistants leak user secrets in up to 26.56% of conversations, highlighting critical privacy risks in personalized AI systems.

Details

Motivation: Personalized AI agents access sensitive user data (emails, chats, purchase histories), creating fundamental privacy risks. Systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being and safe deployment.

Method: Introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and multi-turn conversational evaluation to measure secret preservation. Test Retrieval-Augmented Generation (RAG) assistants and evaluate privacy-aware prompts.

Result: RAG assistants leak secrets in up to 26.56% of interactions. Privacy-aware prompts reduce leakage to 5.12%, but retrieval mechanisms continue to access sensitive data indiscriminately, creating a single point of failure at the generator level.

Conclusion: Current RAG architectures are unsafe for wide-scale deployment due to fundamental privacy flaws. There’s urgent need for structural, privacy-by-design safeguards to ensure ethical and inclusive web systems that protect user secrets.

Abstract: Personalized AI agents rely on access to a user’s digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.

[61] Big AI is accelerating the metacrisis: What can we do?

Steven Bird

Main category: cs.CL

TL;DR: The paper critiques how Big AI and language engineering are exacerbating ecological, meaning, and language crises, calling for a fundamental shift toward human-centered, life-affirming NLP that prioritizes human flourishing and planetary health.

Details

Motivation: The world faces converging ecological, meaning, and language crises (metacrisis), which are being accelerated by Big AI. Current language engineering practices prioritize scalability over human values, serve plutocratic interests, and operate under false assumptions of value-neutrality, creating an urgent need for alternative approaches.

Method: The paper proposes exploring alternatives through collective intelligence and designing new paradigms for NLP that are centered on human flourishing and planetary well-being, moving away from current harmful practices.

Result: The analysis identifies the problematic role of language engineers in perpetuating harmful systems and calls for fundamental reorientation rather than incremental improvements.

Conclusion: There is an urgent need to transform NLP from its current harmful trajectory toward a life-affirming future that centers human flourishing on a living planet, requiring collective action and paradigm shift in language technology development.

Abstract: The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.

[62] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, Jiajun Zhang

Main category: cs.CL

TL;DR: Encyclo-K is a statement-based benchmark that uses knowledge statements from textbooks as the fundamental unit, dynamically composing them into evaluation questions to address data contamination, enable multi-knowledge assessment, and reduce annotation costs.

Details

Motivation: Existing LLM benchmarks have three key limitations: vulnerability to data contamination (models can memorize test data), restriction to single-knowledge-point assessment, and reliance on expensive domain expert annotation. There's a need for a more robust, comprehensive, and cost-effective evaluation framework.

Method: Extract standalone knowledge statements from authoritative textbooks, then dynamically compose them into evaluation questions through random sampling at test time. Each question aggregates 8-10 statements for comprehensive assessment. Annotation only requires verifying formatting compliance, not domain expertise.

Result: Encyclo-K poses substantial challenges to LLMs with strong discriminative power. Top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy. Reasoning models range from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%, showing clear performance gradients.

Conclusion: Encyclo-K successfully addresses the three limitations of existing benchmarks and establishes a scalable framework for dynamic evaluation of LLMs’ comprehensive understanding across multiple fine-grained disciplinary knowledge statements.

Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution–reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs’ comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

[63] mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang

Main category: cs.CL

TL;DR: mHC restores identity mapping in Hyper-Connections to fix training instability while maintaining performance gains.

Details

Motivation: Hyper-Connections improve performance but lose identity mapping property, causing training instability, scalability issues, and memory overhead.

Method: Project HC’s residual connection space onto specific manifold to restore identity mapping, with infrastructure optimization for efficiency.

Result: mHC enables effective large-scale training with performance improvements and superior scalability.

Conclusion: mHC is a flexible, practical HC extension that advances topological architecture design and foundational model evolution.

Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

[64] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi

Main category: cs.CL

TL;DR: ADOPT is a framework for optimizing prompts in multi-step LLM pipelines by modeling dependencies between steps and using text-gradient estimation with Shapley-based resource allocation.

Details

Motivation: Multi-step LLM pipelines are effective for complex tasks but their performance heavily depends on prompts at each step. Joint optimization is difficult due to missing step-level supervision and inter-step dependencies, and existing methods yield suboptimal or unstable updates.

Method: ADOPT explicitly models dependencies between each LLM step and final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reduces multi-prompt optimization to flexible single-prompt optimization steps, and uses Shapley-based mechanism for adaptive resource allocation.

Result: Experiments on real-world datasets and diverse pipeline structures show ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

Conclusion: ADOPT provides a novel framework for optimizing multi-step LLM pipelines by addressing dependency modeling and resource allocation challenges, offering improved performance over existing methods.

Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

[65] Classifying long legal documents using short random chunks

Luis Adrián Cabrera-Diego

Main category: cs.CL

TL;DR: Legal document classifier using DeBERTa V3 + LSTM on random 48 chunks (max 128 tokens each) with Temporal-based deployment pipeline.

Details

Motivation: Legal documents are challenging due to specialized vocabulary and extreme length, making full-document Transformer processing impossible, expensive, or slow.

Method: Combines DeBERTa V3 with LSTM, using 48 randomly-selected short chunks (max 128 tokens each) as input. Deployment pipeline built with Temporal for durable execution and reliable workflow.

Result: Best model achieved weighted F-score of 0.898. Pipeline on CPU had median processing time of 498 seconds per 100 files.

Conclusion: Proposed approach effectively handles long legal documents through chunk-based processing while maintaining good classification performance, with reliable deployment using Temporal workflow orchestration.

Abstract: Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

[66] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Siddhant Agarwal, Adya Dhuler, Polly Ruhnke, Melvin Speisman, Md Shad Akhtar, Shweta Yadav

Main category: cs.CL

TL;DR: RESTOREx introduces a resource for detecting depressive symptoms in memes using LLM-generated and human-annotated explanations, while MAMAMemeia is a multi-agent framework based on Cognitive Analytic Therapy that achieves state-of-the-art performance.

Details

Motivation: Memes have evolved beyond humor to express various emotions, including depressive sentiments. With increasing use of memes to express depression on social media, there's a need to identify depressive symptoms in this medium for mental health monitoring and intervention.

Method: Two main contributions: 1) RESTOREx - a resource for depressive symptom detection in memes using LLM-generated and human-annotated explanations; 2) MAMAMemeia - a collaborative multi-agent multi-aspect discussion framework based on Cognitive Analytic Therapy (CAT) Competencies for analyzing memes.

Result: MAMAMemeia improves upon current state-of-the-art by 7.55% in macro-F1 score and establishes a new benchmark, outperforming over 30 existing methods.

Conclusion: The paper successfully addresses the challenge of detecting depressive symptoms in memes through a novel clinical psychology-based framework and comprehensive dataset, achieving significant performance improvements over existing methods.

Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

[67] Modeling Language as a Sequence of Thoughts

Nasim Borazjanizadeh, James McClelland

Main category: cs.CL

TL;DR: Thought Gestalt (TG) model improves Transformer efficiency by adding recurrent sentence-level “thought” states that persist in memory, reducing data and parameter requirements while improving relational reasoning.

Details

Motivation: Current Transformer models rely too much on surface-level co-occurrence statistics, leading to brittleness in relational reasoning (like reversal curse), contextualization errors, and data inefficiency. Human cognition forms compact event-like representations that persist in memory, which inspired the development of a more efficient architecture.

Method: TG is a recurrent Transformer that models language at two levels: tokens and sentence-level “thought” states. It generates tokens sentence-by-sentence while cross-attending to a memory of prior sentence representations. Both token and sentence representations use the same parameters and are trained with a single next-token cross-entropy objective, allowing gradients to flow backward through cross-attention to optimize earlier sentence vectors.

Result: TG consistently outperforms GPT-2 in scaling experiments, with scaling fits showing GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG’s loss. TG also reduces errors on relational direction generalization tasks, specifically improving performance on father-son reversal curse probes.

Conclusion: The Thought Gestalt model demonstrates that incorporating persistent sentence-level representations inspired by human cognition can significantly improve Transformer efficiency and relational reasoning capabilities while maintaining a simple training objective.

Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level “thought” states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG’s loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

[68] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Chao Peng, Bin Wang, Zhilei Long, Jinfang Sheng

Main category: cs.CL

TL;DR: AdaGReS is a redundancy-aware context selection framework for RAG that optimizes relevance while minimizing redundancy under token budget constraints, with adaptive parameter calibration and theoretical guarantees.

Details

Motivation: Standard top-k retrieval in RAG often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation quality.

Method: AdaGReS performs greedy selection under token-budget constraints using marginal gains from an objective combining query-chunk relevance and intra-set redundancy penalties, with instance-adaptive calibration of the relevance-redundancy trade-off parameter.

Result: Experiments on open-domain QA (Natural Questions) and biomedical corpus show consistent improvements in redundancy control and context quality, leading to better end-to-end answer quality and robustness.

Conclusion: AdaGReS effectively addresses redundancy in RAG context selection with adaptive parameter tuning and theoretical guarantees, improving overall system performance across diverse domains.

Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

[69] CascadeNS: Confidence-Cascaded Neurosymbolic Model for Sarcasm Detection

Swapnil Mane, Vaibhav Khatavkar

Main category: cs.CL

TL;DR: CascadeNS is a confidence-calibrated neurosymbolic architecture for sarcasm detection that selectively activates symbolic or neural reasoning based on calibrated confidence measures, outperforming baselines by 7.44%.

Details

Motivation: Sarcasm detection in product reviews requires both domain-specific symbolic pattern recognition and deep semantic understanding. Existing approaches either favor interpretable symbolic representation or semantic neural modeling, but rarely achieve both effectively. Hybrid methods typically combine these through feature fusion or ensembling, which can degrade performance.

Method: CascadeNS integrates symbolic and neural reasoning through selective activation rather than fusion. It uses a symbolic semigraph to handle pattern-rich instances with high confidence, while delegating semantically ambiguous cases to a neural module based on pre-trained LLM embeddings. The core innovation is a calibrated confidence measure derived from polarity-weighted semigraph scores that determines when symbolic reasoning is sufficient versus when neural analysis is needed.

Result: Experiments on product reviews show that CascadeNS outperforms strong baselines by 7.44%.

Conclusion: The proposed confidence-calibrated neurosymbolic architecture effectively balances symbolic pattern recognition with neural semantic understanding through selective activation, achieving superior performance for sarcasm detection in product reviews compared to existing approaches.

Abstract: Sarcasm detection in product reviews requires balancing domain-specific symbolic pattern recognition with deep semantic understanding. Symbolic representations capture explicit linguistic phenomena that are often decisive for sarcasm detection. Existing work either favors interpretable symbolic representation or semantic neural modeling, but rarely achieves both effectively. Prior hybrid methods typically combine these paradigms through feature fusion or ensembling, which can degrade performance. We propose CascadeNS, a confidence-calibrated neurosymbolic architecture that integrates symbolic and neural reasoning through selective activation rather than fusion. A symbolic semigraph handles pattern-rich instances with high confidence, while semantically ambiguous cases are delegated to a neural module based on pre-trained LLM embeddings. At the core of CascadeNS is a calibrated confidence measure derived from polarity-weighted semigraph scores. This measure reliably determines when symbolic reasoning is sufficient and when neural analysis is needed. Experiments on product reviews show that CascadeNS outperforms the strong baselines by 7.44%.

[70] LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Weizhi Tang, Kwabena Nuamah, Vaishak Belle

Main category: cs.CL

TL;DR: The paper proposes using Linear Temporal Logic (LTL) to automatically generate temporal reasoning challenges for evaluating LLMs, creates a 2000-item dataset, benchmarks 12 LLMs, and analyzes their reasoning processes and performance patterns.

Details

Motivation: To develop a systematic approach for evaluating temporal reasoning (TR) ability in LLMs using formal logic, addressing limitations of prior evaluation methods and enabling automated challenge generation.

Method: Proposes a pipeline using Linear Temporal Logic (LTL) to automatically synthesize TR challenges, constructs a 2000-item dataset called LTL, benchmarks 12 LLMs across 5 methods, and analyzes impact of formula operators and events on performance.

Result: Created comprehensive TR evaluation dataset, benchmarked 12 LLMs, identified 3 main issues in LLMs’ temporal reasoning processes, and discovered unexpected performance changes as problem complexity increases.

Conclusion: The LTL-based approach provides valuable insights into LLMs’ temporal reasoning abilities, revealing systematic issues and complexity-performance relationships that inform future TR evaluation and model improvement.

Abstract: Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely \LTL, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.

[71] Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering

Daehwan Nam, Gary Geunbae Lee

Main category: cs.CL

TL;DR: Grammar-augmented semantic parser with candidate expressions for KBQA, improving accuracy and decoding speed on KQA Pro and Overnight benchmarks.

Details

Motivation: Existing semantic parsers use grammars for constrained decoding but lack ability to utilize large KB information, even though logical forms contain KB elements like entities and relations.

Method: Propose grammar augmented with candidate expressions for seq2seq PLM semantic parsing. Grammar defines actions as production rules, with constraints by types and candidate expressions. Includes sub-type inference, union types, and mask caching algorithm for speed.

Result: Candidate expression constraints increased accuracy on KQA Pro and Overnight benchmarks under both strong and weak supervision. Mask caching and sub-type inference greatly improved decoding speed.

Conclusion: Grammar augmented with candidate expressions improves semantic parsing accuracy and decoding speed for KBQA, with publicly available implementation.

Abstract: Semantic parsers convert natural language to logical forms, which can be evaluated on knowledge bases (KBs) to produce denotations. Recent semantic parsers have been developed with sequence-to-sequence (seq2seq) pre-trained language models (PLMs) or large language models, where the models treat logical forms as sequences of tokens. For syntactic and semantic validity, the semantic parsers use grammars that enable constrained decoding. However, the grammars lack the ability to utilize large information of KBs, although logical forms contain representations of KB elements, such as entities or relations. In this work, we propose a grammar augmented with candidate expressions for semantic parsing on a large KB with a seq2seq PLM. The grammar defines actions as production rules, and our semantic parser predicts actions during inference under the constraints by types and candidate expressions. We apply the grammar to knowledge base question answering, where the constraints by candidate expressions assist a semantic parser to generate valid KB elements. We also introduce two special rules, sub-type inference and union types, and a mask caching algorithm. In particular, sub-type inference and the mask caching algorithm greatly increase the decoding speed of our semantic parser. We experimented on two benchmarks, KQA Pro and Overnight, where the constraints by candidate expressions increased the accuracy of our semantic parser, whether it was trained with strong supervision or weak supervision. In addition, our semantic parser had a fast decoding speed in the experiments. Our source code is publicly available at https://github.com/daehwannam/candexpr-sp.git.

[72] Automatic identification of diagnosis from hospital discharge letters via weakly-supervised Natural Language Processing

Vittorio Torri, Elisa Barbieri, Anna Cantarutti, Carlo Giaquinto, Francesca Ieva

Main category: cs.CL

TL;DR: A weakly-supervised NLP pipeline for classifying Italian discharge letters without manual labeling, using transformer embeddings and clustering to generate weak labels for training a classifier.

Details

Motivation: Traditional supervised approaches for identifying patient diagnoses from discharge letters require extensive manual annotation, which is impractical for large textual datasets. There's a need for scalable solutions that reduce annotation burden while maintaining accuracy.

Method: 1) Extract diagnosis-related sentences from discharge letters; 2) Use transformer-based model with additional pre-training on Italian medical documents to generate semantic embeddings; 3) Apply two-level clustering to embeddings; 4) Map clusters to diseases of interest to derive weak labels; 5) Use weak labels to train a transformer-based classifier.

Result: Achieved AUC of 77.68% (±4.30%) and F1-score of 78.14% (±4.89%) on bronchiolitis classification in 33,176 Italian discharge letters. Performance surpasses unsupervised methods and approaches fully supervised models, with robustness to cluster selection. Saves ~3 minutes per discharge letter (1,500+ hours for the dataset).

Conclusion: The weakly-supervised strategy is feasible for identifying diagnoses from Italian discharge letters, achieving strong performance, adaptability to various diseases, and offering a scalable solution that reduces manual annotation while maintaining reliable accuracy.

Abstract: Identifying patient diagnoses from discharge letters is essential to enable large-scale cohort selection and epidemiological research, but traditional supervised approaches rely on extensive manual annotation, which is often impractical for large textual datasets. In this study, we present a novel weakly-supervised Natural Language Processing pipeline designed to classify Italian discharge letters without requiring manual labelling. After extracting diagnosis-related sentences, the method leverages a transformer-based model with an additional pre-training on Italian medical documents to generate semantic embeddings. A two-level clustering procedure is applied to these embeddings, and the resulting clusters are mapped to the diseases of interest to derive weak labels for a subset of data, eventually used to train a transformer-based classifier. We evaluate the approach on a real-world case study on bronchiolitis in a corpus of 33,176 Italian discharge letters of children admitted to 44 emergency rooms or hospitals in the Veneto Region between 2017 and 2020. The pipeline achieves an area under the curve (AUC) of 77.68% ($\pm 4.30%)$ and an F1-score of 78.14% ($\pm 4.89%$) against manual annotations. Its performance surpasses other unsupervised methods and approaches fully supervised models, maintaining robustness to cluster selection and promising generalizability across different disease types. It allows saving approximately 3 minutes of expert time per discharge letter, resulting in more than 1,500 hours for a dataset like ours. This study demonstrates the feasibility of a weakly-supervised strategy for identifying diagnoses from Italian discharge letters. The pipeline achieves strong performance, is adaptable to various diseases, and offers a scalable solution for clinical text classification, reducing the need for manual annotation while maintaining reliable accuracy.

[73] Bielik 7B v0.1: A Polish Language Model – Development, Insights, and Evaluation

Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas

Main category: cs.CL

TL;DR: Bielik 7B v0.1 is a 7B-parameter Polish language model with novel training techniques and evaluation frameworks that outperforms Mistral-7B-v0.1 by 9 percentage points on RAG Reader tasks.

Details

Motivation: To develop a specialized generative text model for Polish language processing that addresses key challenges in language model development and advances Polish language AI capabilities.

Method: Trained on curated Polish corpora using Weighted Instruction Cross-Entropy Loss (balances different instruction types) and Adaptive Learning Rate (dynamically adjusts based on training progress). Created Open PL LLM Leaderboard and Polish MT-Bench for evaluation.

Result: Achieves 9 percentage point improvement over Mistral-7B-v0.1 on RAG Reader task. Excels in Polish MT-Bench with Reasoning (6.15/10) and Role-playing (7.83/10) scores. Demonstrates significant performance gains in Polish language tasks.

Conclusion: Bielik 7B v0.1 represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field through innovative training techniques and comprehensive evaluation frameworks.

Abstract: We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in language model development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.

[74] Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

Maria Paola Priola

Main category: cs.CL

TL;DR: Combines RAG for hallucination mitigation with NMISS scoring for detection in LLMs, using Italian health news as context. GPT-4 and Gemma2 perform best, while NMISS helps mid-tier models like Llama2/3 and Mistral by better evaluating contextual accuracy.

Details

Motivation: Addresses the problem of hallucinations in Large Language Models, particularly in question-answering contexts where inaccurate responses can have serious consequences in domains like healthcare. Traditional evaluation metrics often misclassify contextually accurate responses as hallucinations.

Method: Combines two approaches: 1) Retrieval-Augmented Generation (RAG) framework for mitigation by grounding answers in external data, and 2) Negative Missing Information Scoring System (NMISS) for detection, which accounts for contextual relevance to better identify true hallucinations. Uses Italian health news articles as context for evaluation.

Result: Gemma2 and GPT-4 outperform other models, with GPT-4 producing answers most closely aligned with reference responses. Mid-tier models (Llama2, Llama3, Mistral) benefit significantly from NMISS scoring, which reveals their ability to provide richer contextual information that traditional metrics would incorrectly flag as hallucinations.

Conclusion: The combined RAG+NMISS approach offers new insights for both reducing and more accurately assessing hallucinations in LLMs. This has practical applications in healthcare and other domains where accurate, contextually grounded responses are critical.

Abstract: I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.

[75] Quantifying Positional Biases in Text Embedding Models

Reagan J. Lee, Samarth Goel, Kannan Ramchandran

Main category: cs.CL

TL;DR: Embedding models show strong positional bias, disproportionately prioritizing text at the beginning of inputs regardless of positional encoding mechanisms, with beginning ablations reducing similarity 12.3% more than end ablations.

Details

Motivation: Embedding models are crucial for IR and semantic similarity tasks, but their handling of longer texts and associated positional biases remains underexplored, creating a gap in understanding model robustness and reliability.

Method: Conducted experiments on embedding models with ablation studies (inserting irrelevant text or removing text at different positions) and regression analysis to measure positional bias effects on cosine similarity between altered and original embeddings.

Result: Embedding models disproportionately prioritize the beginning of inputs, with beginning ablations reducing cosine similarity by up to 12.3% more than end ablations. Sentence importance declines as position moves further from the start, even with content-agnostic approaches.

Conclusion: The findings quantify retrieval system sensitivity and reveal positional bias in embedding models, suggesting a new perspective on embedding model robustness and highlighting the impact of pre-processing strategies and positional encoding techniques.

Abstract: Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards embedding model robustness.

[76] Large Multimodal Models for Low-Resource Languages: A Survey

Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: Survey analyzes 117 studies on adapting large multimodal models for 96 low-resource languages, categorizing approaches and identifying visual information as key bridge for performance improvement.

Details

Motivation: To systematically understand how researchers adapt large multimodal models (LMMs) for low-resource languages, addressing challenges of limited data and computational resources to make LMMs more accessible to speakers of understudied languages.

Method: Comprehensive analysis of 117 studies across 96 low-resource languages, categorizing works into resource-oriented and method-oriented contributions with relevant sub-categories. Comparison of method-oriented approaches in terms of performance and efficiency.

Result: Visual information serves as crucial bridge for improving model performance in low-resource settings. Identified key patterns in tackling limited data challenges, though significant challenges remain in hallucination mitigation and computational efficiency.

Conclusion: Provides clear understanding of current approaches and remaining challenges in adapting LMMs for low-resource languages, with open-source repository for ongoing research and accessibility improvements.

Abstract: In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

[77] ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim

Main category: cs.CL

TL;DR: Visual Instruction Rewriting transforms multimodal (vision+language) instructions into text-only commands using lightweight on-device models to preserve visual privacy while maintaining functionality.

Details

Motivation: As AR/VR and camera-equipped devices become primary interfaces, cloud-based large vision-language models raise privacy concerns by transmitting sensitive visual data to servers and have limited real-time on-device usability.

Method: Developed a dataset of 39,000+ examples across 14 domains, created a compact VLM (250M parameters) pretrained on image captioning and fine-tuned for instruction rewriting, with quantized version under 500MB storage.

Result: The quantized model achieves effective instruction rewriting as measured by NLG metrics (BLEU, METEOR, ROUGE) and semantic parsing analysis, enabling privacy-focused multimodal applications.

Conclusion: Visual Instruction Rewriting enables privacy-preserving multimodal interaction by converting vision-language inputs to text-only commands that can be processed locally, addressing both privacy concerns and real-time usability limitations of cloud-based VLMs.

Abstract: Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

[78] A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng

Main category: cs.CL

TL;DR: Survey paper analyzing efficiency challenges in Large Reasoning Models (LRMs) that produce excessively long reasoning traces, and reviewing methods to improve reasoning efficiency across the model lifecycle.

Details

Motivation: Recent LRMs like DeepSeek-R1 and OpenAI o1 show strong performance but produce excessively long reasoning traces with redundant content, over-analysis of simple problems, and superficial exploration of multiple paths. This inefficiency creates challenges for training, inference, and real-world deployment where token economy is critical.

Method: The paper is a survey that provides comprehensive overview of recent efforts to improve reasoning efficiency in LRMs. It identifies common patterns of inefficiency, examines methods across the LRM lifecycle (from pretraining to inference), and discusses future research directions. The authors also maintain a real-time GitHub repository tracking progress.

Result: The survey organizes and analyzes the current state of research on reasoning efficiency in LRMs, identifying key challenges and proposed solutions across different stages of model development and deployment.

Conclusion: The survey serves as a foundation for further exploration and aims to inspire innovation in improving reasoning efficiency in rapidly evolving Large Reasoning Models, addressing critical token economy concerns for practical deployment.

Abstract: Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.

[79] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Ding Chen, Qingchen Yu, Pengyuan Wang, Mengting Hu, Wentao Zhang, Zhengren Wang, Bo Tang, Feiyu Xiong, Xinchi Li, Chao Wang, Minchuan Yang, Zhiyu Li

Main category: cs.CL

TL;DR: xVerify is an efficient answer verifier for evaluating reasoning models that addresses limitations of existing evaluation methods in judging answer equivalence and extracting final answers from complex reasoning outputs.

Details

Motivation: Existing evaluation methods and reward models are inadequate for reasoning models that produce complex reasoning, intermediate steps, and self-reflection. They struggle with judging answer equivalence and reliably extracting final answers from long, complex responses.

Method: Propose xVerify, an efficient answer verifier trained on VAR dataset (question-answer pairs from multiple LLMs across various datasets). Construct VAR with multi-round annotation for quality, then train xVerify models at different scales (0.5B to 3B parameters).

Result: All xVerify variants achieve over 95% F1 score and accuracy. xVerify-0.5B-I outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o overall. Using xVerify as reward model yields 18.4% improvement for Qwen2.5-7B vs direct generation.

Conclusion: xVerify demonstrates strong equivalence judgment capabilities and generalizability for evaluating reasoning models, with smaller models outperforming most existing methods and larger models surpassing GPT-4o. The approach effectively addresses challenges in reasoning model evaluation.

Abstract: With the release of OpenAI’s o1 model, reasoning models that adopt slow-thinking strategies have become increasingly common. Their outputs often contain complex reasoning, intermediate steps, and self-reflection, making existing evaluation methods and reward models inadequate. In particular, they struggle to judge answer equivalence and to reliably extract final answers from long, complex responses. To address this challenge, we propose xVerify, an efficient answer verifier for evaluating reasoning models. xVerify shows strong equivalence judgment capabilities, enabling accurate comparison between model outputs and reference answers across diverse question types. To train and evaluate xVerify, we construct the VAR dataset, which consists of question-answer pairs generated by multiple LLMs across various datasets. The dataset incorporates multiple reasoning models and challenging evaluation sets specifically designed for reasoning assessment, with a multi-round annotation process to ensure label quality. Based on VAR, we train xVerify models at different scales. Experimental results on both test and generalization sets show that all xVerify variants achieve over 95% F1 score and accuracy. Notably, the smallest model, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. In addition, reinforcement learning experiments using xVerify as the reward model yield an 18.4% improvement for Qwen2.5-7B compared with direct generation, exceeding the gains achieved with Math Verify as the reward. These results demonstrate the effectiveness and generalizability of xVerify. All xVerify resources are available on \href{https://github.com/IAAR-Shanghai/xVerify}{GitHub}.

[80] Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang

Main category: cs.CL

TL;DR: Pre-DPO improves DPO-based preference optimization by using a guiding reference model that provides foresight into optimal policy states and adaptively weights training samples for better data utilization.

Details

Motivation: Standard DPO has inefficient data utilization due to identical initialization of policy and reference models, while SimPO lacks robustness due to no reference model. There's a need for better preference optimization that leverages reference models effectively.

Method: Pre-DPO introduces a guiding reference model that provides foresight into optimal policy states achievable through training data. This model adaptively assigns higher weights to suitable samples and lower weights to unsuitable ones, enhancing training efficiency.

Result: Pre-DPO consistently improves performance of both DPO and SimPO on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks without needing external models or additional data.

Conclusion: Pre-DPO offers a simple yet effective enhancement to DPO-based training paradigms by intelligently leveraging reference models for better preference optimization and data utilization.

Abstract: Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

[81] An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

Matan Orbach, Ohad Eytan, Benjamin Sznajder, Ariel Gera, Odellia Boni, Yoav Kantor, Gal Bloch, Omri Levy, Hadas Abraham, Nitzan Barzilay, Eyal Shnarch, Michael E. Factor, Shila Ofek-Koifman, Paula Ta-Shma, Assaf Toledo

Main category: cs.CL

TL;DR: RAG HPO frameworks exist but lack rigorous benchmarking. This study shows RAG HPO can be done efficiently with greedy or random search, significantly boosting performance across diverse datasets.

Details

Motivation: Optimizing RAG configurations is complex and resource-intensive. While RAG HPO frameworks have emerged, their effectiveness hasn't been rigorously benchmarked, creating a gap in understanding their real-world value.

Method: Comprehensive study with five HPO algorithms tested on five datasets from diverse domains, including a new real-world product documentation dataset. Explored the largest RAG HPO search space to date with full grid-search evaluations, using three evaluation metrics as optimization targets.

Result: RAG HPO can be done efficiently with either greedy approaches or random search. It significantly boosts RAG performance across all datasets. For greedy HPO, optimizing model selection first is better than following the traditional RAG pipeline order.

Conclusion: RAG hyper-parameter optimization is both feasible and effective, with simple approaches like greedy or random search providing significant performance improvements. The optimal strategy involves prioritizing model selection over pipeline-order optimization.

Abstract: Optimizing Retrieval-Augmented Generation (RAG) configurations for specific tasks is a complex and resource-intensive challenge. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To fill this gap, we present a comprehensive study involving five HPO algorithms over five datasets from diverse domains, including a newly curated real-world product documentation dataset. Our study explores the largest RAG HPO search space to date that includes full grid-search evaluations, and uses three evaluation metrics as optimization targets. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing model selection first is preferable to the common practice of following the RAG pipeline order during optimization.

[82] MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa

Main category: cs.CL

TL;DR: The paper introduces two benchmarks (MangaOCR and MangaVQA) and a specialized model (MangaLMM) for multimodal manga understanding, evaluating how well large multimodal models comprehend complex manga narratives.

Details

Motivation: Manga is a complex multimodal narrative form blending images and text. Understanding manga at a human-like level could help manga creators reflect on and refine their stories, but current LMMs lack specialized evaluation and capabilities for this domain.

Method: 1) Created two benchmarks: MangaOCR for in-page text recognition and MangaVQA (526 manually constructed QA pairs) for contextual understanding via visual question answering. 2) Developed MangaLMM, a manga-specialized model finetuned from Qwen2.5-VL to handle both tasks.

Result: The benchmarks enable reliable evaluation across diverse narrative and visual scenarios. MangaLMM was evaluated through extensive experiments including comparisons with proprietary models (GPT-4o, Gemini 2.5), providing insights into how well LMMs understand manga.

Conclusion: The benchmarks and specialized model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga, potentially supporting manga creators in story refinement.

Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

[83] Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs

Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken

Main category: cs.CL

TL;DR: A novel atomic fact-checking framework improves LLM reliability in medical Q&A by decomposing responses into verifiable atomic facts and checking them against medical guidelines, achieving 40% answer improvement and 50% hallucination detection.

Details

Motivation: LLMs show extensive medical knowledge but suffer from hallucinations and inaccurate citations, hindering clinical adoption and regulatory compliance. Current methods like RAG partially address issues but hallucinations and low fact-level explainability persist.

Method: Introduces atomic fact-checking framework that decomposes LLM-generated responses into discrete, verifiable atomic facts, each independently verified against authoritative medical guideline knowledge base. Enables targeted error correction and direct tracing to source literature.

Result: Extensive evaluation with medical expert multi-reader assessments and automated open Q&A benchmark showed significant improvements: up to 40% overall answer improvement and 50% hallucination detection rate. Provides granular, transparent explanations by tracing each atomic fact to relevant database chunks.

Conclusion: This framework represents a crucial step toward more trustworthy clinical LLM applications, addressing key prerequisites for clinical use and fostering greater confidence in AI-assisted healthcare by improving factual accuracy and explainability.

Abstract: Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.

[84] Toward Robust Legal Text Formalization into Defeasible Deontic Logic using LLMs

Elias Horner, Cristinel Mateis, Guido Governatori, Agata Ciabattoni

Main category: cs.CL

TL;DR: LLMs can effectively formalize legal texts into Defeasible Deontic Logic using a structured pipeline with refinement steps, achieving results close to expert formalizations.

Details

Motivation: To develop scalable automated formalization of legal texts using LLMs, addressing the challenge of transforming complex normative language into structured logical representations for legal informatics applications.

Method: A structured pipeline that segments legal texts into atomic snippets, extracts deontic rules, evaluates syntactic/semantic coherence, with a novel two-stage approach including refinement steps and various LLM configurations (prompting, fine-tuning).

Result: LLMs can produce formalizations closely aligned with expert-crafted representations when properly guided, demonstrated on Australian Telecommunications Consumer Protections Code with improved metrics and stricter error assessment.

Conclusion: LLMs show significant potential for scalable legal informatics through automated formalization of legal texts into Defeasible Deontic Logic, especially with structured guidance and refinement processes.

Abstract: We present a comprehensive approach to the automated formalization of legal texts using large language models (LLMs), targeting their transformation into Defeasible Deontic Logic (DDL). Our method employs a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. We introduce a refined success metric that more precisely captures the completeness of formalizations, and a novel two-stage pipeline with a dedicated refinement step to improve logical consistency and coverage. The evaluation procedure has been strengthened with stricter error assessment, and we provide comparative results across multiple LLM configurations, including newly released models and various prompting and fine-tuning strategies. Experiments on legal norms from the Australian Telecommunications Consumer Protections Code demonstrate that, when guided effectively, LLMs can produce formalizations that align closely with expert-crafted representations, underscoring their potential for scalable legal informatics.

[85] A Survey on LLM-Assisted Clinical Trial Recruitment

Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff

Main category: cs.CL

TL;DR: Survey paper analyzing LLM applications for clinical trial-patient matching, examining benchmarks, approaches, challenges, and future directions in clinical trial recruitment.

Details

Motivation: LLMs have improved general NLP tasks but adoption in critical domains like clinical trial recruitment remains limited. Trials use natural language design and patient data includes both structured/unstructured text, making trial-patient matching ideal for LLMs' knowledge aggregation and reasoning abilities.

Method: Survey methodology: First comprehensive analysis of trial-patient matching task, contextualizing emerging LLM-based approaches. Critical examination of existing benchmarks, approaches, evaluation frameworks, adoption challenges, and future directions.

Result: Identifies limitations of current approaches: reliance on proprietary models and weak evaluation benchmarks. Classical methods are trial-specific while LLMs offer potential for more general solutions through distributed knowledge consolidation.

Conclusion: LLMs hold significant potential for clinical trial recruitment through trial-patient matching, but current applications face challenges including proprietary model dependencies and inadequate evaluation frameworks. The survey provides foundational analysis for advancing LLM adoption in clinical research.

Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.

[86] MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Meng Fang

Main category: cs.CL

TL;DR: MuRating is a framework that transfers English data-quality signals to create a multilingual rater for 17 languages, enabling better data selection for LLM pretraining and improving performance on both English and multilingual benchmarks.

Details

Motivation: Existing data-quality selection methods for large language models focus almost exclusively on English, creating a gap for multilingual applications. There's a need for scalable approaches to identify high-quality data across multiple languages to improve LLM performance globally.

Method: MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores, then projects these judgments through translation to train a multilingual evaluator. The framework trains on monolingual, cross-lingual, and parallel text pairs to create a single multilingual rater for 17 target languages.

Result: Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2B-parameter LLaMA model. Compared to strong baselines (QuRater, AskLLM, DCLM), the approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.

Conclusion: MuRating successfully transfers English data-quality signals to multilingual contexts, improving LLM performance across languages. The paper identifies areas for future work including translation fidelity, selection biases, and underrepresentation of narrative material in current approaches.

Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.

[87] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut

Main category: cs.CL

TL;DR: PERK is a parameter-efficient method for long-context reasoning that uses test-time gradient updates to encode contexts into lightweight LoRA adapters, outperforming prompt-based baselines.

Details

Motivation: Long-context reasoning requires identifying relevant information in extensive noisy contexts. Existing meta-learning methods for test-time learning are too memory-intensive for long contexts, limiting their applicability.

Method: PERK uses two nested optimization loops in meta-training: inner loop rapidly encodes contexts into low-rank adapters (LoRA) as parameter-efficient memory modules; outer loop learns to use updated adapters to recall and reason over relevant information from encoded long contexts.

Result: PERK significantly outperforms standard prompt-based long-context baselines, achieving up to 90% absolute performance gains for smaller models (GPT-2) and up to 27% for larger models (Qwen-2.5-0.5B). It’s more robust to reasoning complexity, length extrapolation, and location of relevant information.

Conclusion: PERK provides a scalable approach for long-context reasoning that, while memory-intensive during training, scales more efficiently at inference than prompt-based methods, enabling effective reasoning over noisy long contexts through parameter-efficient test-time learning.

Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

[88] Natural Language Processing for Tigrinya: Current State and Future Directions

Fitsum Gaim, Jong C. Park

Main category: cs.CL

TL;DR: Comprehensive survey of NLP research for Tigrinya language (2011-2025), analyzing over 50 studies across 15 tasks, revealing progression from rule-based to neural systems, identifying challenges, and providing roadmap for future work.

Details

Motivation: Tigrinya is spoken by millions but severely underrepresented in NLP research, creating a need to systematically review existing work, identify gaps, and provide direction for future research to advance computational linguistics for this language.

Method: Systematic survey methodology analyzing over 50 studies from 2011-2025, reviewing computational resources, models, and applications across fifteen downstream NLP tasks, with analysis of trajectory from rule-based to neural architectures.

Result: Identified clear progression from foundational rule-based systems to modern neural architectures, driven by resource creation milestones. Found key challenges in Tigrinya’s morphological complexity and resource scarcity, with promising directions in morphology-aware modeling and cross-lingual transfer.

Conclusion: The survey serves as both reference for researchers and roadmap for advancing Tigrinya NLP, highlighting need for continued resource development and specialized approaches to address language-specific challenges. Public anthology of studies and resources provided.

Abstract: Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 50 studies from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across fifteen downstream tasks, including morphological processing, part-of-speech tagging, named entity recognition, machine translation, question-answering, speech recognition, and synthesis. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently driven by milestones in resource creation. We identify key challenges rooted in Tigrinya’s morphological properties and resource scarcity, and highlight promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves both as a reference for researchers and as a roadmap for advancing Tigrinya NLP. An anthology of surveyed studies and resources is publicly available.

[89] Multi-step retrieval and reasoning improves radiology question answering with large language models

Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: RaR (radiology Retrieval and Reasoning) is a multi-step retrieval and reasoning framework that improves LLM performance in radiology question answering by enhancing diagnostic accuracy, reducing hallucinations, and providing better factual grounding compared to traditional single-step RAG systems.

Details

Motivation: Traditional RAG systems for radiology QA use single-step retrieval, which limits their ability to handle complex clinical reasoning tasks. There's a need for more sophisticated frameworks that can improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology applications.

Method: Proposed RaR framework with multi-step retrieval and reasoning. Evaluated 25 LLMs across diverse architectures, parameter scales (0.5B to >670B), and training paradigms. Tested on 104 expert-curated radiology questions from RSNA-RadioQA and ExtendedQA datasets, plus 65 real-world radiology board examination questions for generalizability assessment.

Result: RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional RAG. Greatest gains in small-scale models; very large models (>200B) showed minimal changes (<2% improvement). Reduced hallucinations by mean 9.4%, retrieved clinically relevant context in 46% of cases. Even clinically fine-tuned models like MedGemma-27B benefited from RaR.

Conclusion: RaR enhances factuality and diagnostic accuracy in radiology QA, demonstrating that retrieval remains beneficial even for models with embedded domain knowledge. The framework shows potential for clinical applications, with all datasets, code, and framework publicly available for open research and clinical translation.

Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose radiology Retrieval and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from RaR (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of RaR to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full RaR framework are publicly available to support open research and clinical translation.

[90] MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

Main category: cs.CL

TL;DR: MedQARo is the first large-scale Romanian medical QA benchmark with 105,880 cancer-related QA pairs, showing that fine-tuned LLMs outperform zero-shot models, highlighting the need for domain/language-specific adaptation.

Details

Motivation: There's a lack of QA datasets in specific domains and languages, hindering development of robust AI models that can generalize across domains and languages, particularly for medical QA in Romanian.

Method: Created MedQARo dataset with 105,880 QA pairs from 1,242 cancer patient case summaries via manual annotation by 7 physicians (3,000 work hours). Evaluated 4 open-source LLMs in zero-shot and fine-tuned scenarios, plus 2 API-based models (GPT-5.2, Gemini 3 Flash).

Result: Fine-tuned models significantly outperform zero-shot models, showing pretrained models fail to generalize on MedQARo. Cross-domain evaluation reveals generalization challenges.

Conclusion: Domain-specific and language-specific fine-tuning is crucial for reliable clinical QA in Romanian. The dataset enables better assessment of generalization capabilities in medical AI.

Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

[91] In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

Seungkyu Lee, Nalim Kim, Yohan Jo

Main category: cs.CL

TL;DR: In-N-Out is an expert-annotated dataset of API graphs that captures API dependencies from documentation, significantly improving tool agent performance on complex multi-tool tasks.

Details

Motivation: Tool agents (LLM-based systems using external APIs) struggle with complex tasks requiring proper API identification and compositional calling sequences. Current approaches using raw API documentation alone are insufficient for handling multi-tool queries.

Method: Convert API documentation into structured API graphs capturing API dependencies, create In-N-Out dataset (expert-annotated API graphs from two real-world API benchmarks), and use this for multi-tool query generation requiring compositional API calls.

Result: Using In-N-Out nearly doubles performance on both tool retrieval and multi-tool query generation compared to LLMs using documentation alone. Models fine-tuned on In-N-Out close 90% of the performance gap, showing effective learning of API documentation comprehension and parameter relationships.

Conclusion: Explicit API graphs show promise for improving tool agents, and In-N-Out serves as a valuable resource for advancing this research area. The dataset and code are publicly released.

Abstract: Tool agents–LLM-based systems that interact with external APIs–offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We release our dataset and code at https://github.com/holi-lab/In-N-Out-API-Graph.

[92] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen, Wei Sun, Qixiang Yin, Zhixing Tan, Jiajun Zhang

Main category: cs.CL

TL;DR: ACE-RL: A reinforcement learning framework that uses adaptive constraint criteria to optimize LLMs for long-form generation, outperforming existing methods and even GPT-4o on writing tasks.

Details

Motivation: Current approaches to long-form generation are limited by scarce high-quality training data and reliance on coarse-grained metrics that don't capture nuanced, scenario-specific requirements of real-world tasks.

Method: ACE-RL decomposes instructions into fine-grained adaptive constraint criteria, designs a reward mechanism based on constraint satisfaction, and uses reinforcement learning to optimize LLMs with these fine-grained signals.

Result: ACE-RL outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and the top-performing model surpasses GPT-4o by 8.76%.

Conclusion: ACE-RL provides a more effective training paradigm for long-form generation scenarios by converting subjective quality evaluation into constraint verification and using fine-grained signals for optimization.

Abstract: Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.

[93] SiDiaC: Sinhala Diachronic Corpus

Nevidu Jayatilleke, Nisansa de Silva

Main category: cs.CL

TL;DR: SiDiaC is the first comprehensive Sinhala diachronic corpus spanning 5th-20th century CE with 58k words from 46 literary works, annotated by date and genre, enabling historical linguistic research for this low-resource language.

Details

Motivation: To address the lack of diachronic resources for Sinhala, a low-resourced language, enabling historical linguistic studies, lexical change analysis, neologism tracking, and corpus-based lexicography.

Method: Digitized texts from National Library of Sri Lanka using Google Document AI OCR, post-processed for formatting and orthographic modernization. Applied careful annotation based on written dates after filtering for availability, authorship, copyright compliance, and data attribution. Used practices from other corpora like FarPaHC for syntactic annotation and text normalization. Implemented two-layer genre categorization: primary (Non-Fiction/Fiction) and secondary (Religious, History, Poetry, Language, Medical).

Result: Created SiDiaC corpus with 58k words across 46 literary works spanning 5th-20th century CE. The corpus is date-annotated and genre-categorized, serving as foundational resource for Sinhala NLP and enabling diachronic studies despite challenges with rare text access and secondary date sources.

Conclusion: SiDiaC significantly extends Sinhala language resources, enabling comprehensive diachronic research in lexical change, neologism tracking, historical syntax, and corpus-based lexicography for this previously under-resourced language.

Abstract: SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.

[94] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen

Main category: cs.CL

TL;DR: LiRA is a multi-agent workflow that automates literature review writing, outperforming existing baselines in writing quality and citation accuracy while maintaining similarity to human reviews.

Details

Motivation: The rapid growth of scientific publications makes comprehensive literature reviews difficult to maintain. While retrieval and screening have been automated, the writing phase remains under-explored, particularly regarding readability and factual accuracy.

Method: LiRA uses a multi-agent collaborative workflow that emulates human literature review processes, with specialized agents for content outlining, subsection writing, editing, and reviewing to produce cohesive review articles.

Result: LiRA outperforms current baselines (AutoSurvey and MASS-Survey) in writing and citation quality on SciReviewGen and ScienceDirect datasets, while maintaining competitive similarity to human-written reviews. It also shows robustness to reviewer model variation.

Conclusion: Agentic LLM workflows, even without domain-specific tuning, have significant potential to improve the reliability and usability of automated scientific writing for literature reviews.

Abstract: The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.

[95] Large Language Model Sourcing: A Survey

Liang Pang, Jia Gu, Sunhao Dai, Zihao Wei, Zenghao Duan, Kangxi Wu, Zhiyi Yin, Jun Xu, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: This survey paper systematically investigates multi-perspective information sourcing for LLMs to address issues like hallucinations, bias, and copyright infringement, proposing a dual-paradigm taxonomy of proactive and retrospective approaches.

Details

Motivation: The black-box nature of LLMs and the realism of their generated content have led to significant problems including hallucinations, bias, unfairness, and copyright infringement. To address these issues, sourcing information from multiple perspectives is essential for enhancing transparency and accountability.

Method: The survey organizes investigation around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. It proposes a unified dual-paradigm taxonomy that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches.

Result: The paper presents a systematic framework for understanding and implementing traceability across multiple dimensions of LLM development and deployment, providing a comprehensive survey of existing sourcing methods organized around the proposed taxonomy.

Conclusion: Traceability across the four sourcing dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications, addressing critical issues like hallucinations, bias, and copyright infringement through multi-perspective information sourcing.

Abstract: Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

[96] Invisible Languages of the LLM Universe

Saurabh Khanna, Xinxu Li

Main category: cs.CL

TL;DR: The paper analyzes linguistic inequality in AI systems, identifying four categories of languages based on vitality and digital presence, and argues this reflects colonial-era power structures rather than technical necessity.

Details

Motivation: Despite massive multilingual training data, LLMs exclude approximately 2,000 languages with millions of speakers, creating a digital crisis where marginalized languages remain invisible in AI ecosystems.

Method: Proposes a critical framework connecting empirical measurements of language vitality (demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice, analyzing data across all documented human languages.

Result: Identifies four language categories: Strongholds (33%), Digital Echoes (6%), Fading Voices (36%), and Invisible Giants (27% - high vitality but near-zero digitality). Shows patterns reflect colonial-era linguistic hierarchies.

Conclusion: English dominance in AI is an artifact of power structures that systematically exclude marginalized linguistic knowledge, requiring decolonization of language technology and democratization of AI access.

Abstract: Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world’s 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.

[97] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen

Main category: cs.CL

TL;DR: MTI is a training-free framework that improves reasoning accuracy by selectively applying interventions only to high-uncertainty tokens, achieving significant performance gains with minimal computational overhead.

Details

Motivation: Current LLM scaling approaches focus on increasing inference computation for better reasoning, but this comes at high efficiency costs. The authors discovered that reasoning uncertainty is highly localized - only a small subset of high-entropy tokens significantly affects output correctness, suggesting opportunities for more targeted interventions.

Method: Minimal Test-Time Intervention (MTI) includes two key components: 1) Selective CFG intervention that applies classifier-free guidance only at uncertain token positions, and 2) Lightweight negative-prompt guidance that reuses the main model’s KV cache to approximate unconditional decoding efficiently without additional computational burden.

Result: MTI achieves consistent improvements across general, coding, and STEM tasks: +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0, while maintaining high efficiency compared to traditional test-time scaling approaches.

Conclusion: The paper demonstrates that targeted interventions at uncertain token positions can significantly improve reasoning accuracy with minimal computational overhead, offering a more efficient alternative to brute-force test-time scaling approaches in LLMs.

Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.

[98] OpenSIR: Open-Ended Self-Improving Reasoner

Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini

Main category: cs.CL

TL;DR: OpenSIR is a self-play framework where LLMs alternate teacher/student roles to generate and solve novel math problems without external supervision, enabling open-ended learning and performance improvements.

Details

Motivation: Current LLM reasoning methods rely on annotated datasets with verifiable rewards, which limits ability to surpass human-level performance. Self-play offers an alternative but existing approaches depend on external verifiers or cannot learn open-endedly.

Method: OpenSIR uses self-play where an LLM alternates between teacher and student roles. The teacher generates novel problems optimized for both difficulty and diversity, while the student solves them. No external supervision is needed - the framework rewards problems that are appropriately challenging while exploring distinct concepts.

Result: Starting from a single trivial seed problem, OpenSIR significantly improves instruction models: Llama-3.2-3B-Instruct improved from 73.9 to 78.3 on GSM8K and from 28.8 to 34.4 on College Math; Gemma-2-2B-Instruct improved from 38.5 to 58.7 on GSM8K.

Conclusion: OpenSIR enables open-ended mathematical discovery through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, allowing autonomous progression from basic to advanced mathematics without external supervision.

Abstract: Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models’ ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.

[99] Can ensembles improve evidence recall? A case study

Katharina Beckh, Sven Heuser, Stefan Rüping

Main category: cs.CL

TL;DR: Ensemble approach improves complete evidence recall in medical NLP tasks over individual models

Details

Motivation: Many applications (compliance, cataloging) require identifying the full set of contributing features (complete evidence), not just minimal sufficient evidence provided by typical feature attribution methods

Method: Case study using existing language models on medical dataset with human-annotated complete evidence; ensemble approach aggregating evidence from several models; examined ensemble sizes and effect of evidence-guided training

Result: Ensemble approach improves evidence recall over individual models; provides insights on ensemble sizes and evidence-guided training effects

Conclusion: Ensemble methods are effective for obtaining complete evidence in medical NLP applications where full feature attribution is required

Abstract: Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications, such as compliance and cataloging, the full set of contributing features must be identified: complete evidence. We present a case study using existing language models and a medical dataset which contains human-annotated complete evidence. Our findings show that an ensemble approach, aggregating evidence from several models, improves evidence recall over individual models. We examine different ensemble sizes, the effect of evidence-guided training, and provide qualitative insights.

[100] Training Language Models to Explain Their Own Computations

Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas

Main category: cs.CL

TL;DR: LMs can be trained to generate natural language explanations of their own internal computations, and self-explanation works better than having other models explain them.

Details

Motivation: To determine if language models can learn to faithfully describe their own internal computations and whether their privileged access to their own internals can be leveraged for better interpretability techniques.

Method: Fine-tuned LMs on existing interpretability techniques as ground truth to generate natural language descriptions of: (1) information encoded by LM features, (2) causal structure of internal activations, and (3) influence of specific input tokens on outputs.

Result: Explainer models show non-trivial generalization to new queries with only tens of thousands of training examples. Self-explanation (model explaining itself) works better than having different models explain its computations, even when the other model is more capable.

Conclusion: LMs can learn to reliably explain their internal computations, and such introspective explanations offer a scalable complement to existing interpretability methods.

Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs’ privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs’ internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models’ privileged access to their own internals: using a model to explain its own computations generally works better than using a different model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods. Code and data at https://github.com/TransluceAI/introspective-interp

[101] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

Main category: cs.CL

TL;DR: This paper investigates how Multimodal Large Language Models (MLLMs) process sound symbolism using phonetic iconicity as a probe, creating the LEX-ICON dataset and analyzing models’ phoneme-level attention patterns across multiple languages.

Details

Motivation: The paper aims to use sound symbolism (non-arbitrary associations between phonetic forms and meanings) as a probe to understand how MLLMs interpret auditory information in human languages, bridging artificial intelligence and cognitive linguistics.

Method: Researchers created LEX-ICON, an extensive dataset of 8,052 words from four languages (English, French, Japanese, Korean) and 2,930 pseudo-words, annotated with semantic features. They investigated MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory inputs, measuring phoneme-level attention fraction scores across 25 semantic dimensions.

Result: Key findings show: (1) MLLMs demonstrate phonetic intuitions that align with existing linguistic research across multiple semantic dimensions, and (2) phonosemantic attention patterns reveal models’ focus on iconic phonemes.

Conclusion: The study provides the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability, bridging domains of artificial intelligence and cognitive linguistics through systematic investigation of sound symbolism processing.

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[102] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

Anupama Sitaraman, Bharathan Balaji, Yuvraj Agarwal

Main category: cs.CL

TL;DR: SpiderGen is an LLM-based workflow that generates Product Category Rules Process Flow Graphs for Life Cycle Assessments, reducing costs from $25,000+ to under $1 and time from 21-person days to under 10 minutes.

Details

Motivation: Climate change concerns require tools to estimate environmental impact of consumer products through Life Cycle Assessments, which are currently expensive and time-consuming to produce manually.

Method: SpiderGen integrates traditional LCA taxonomy/methodology with LLM reasoning capabilities to generate PCR PFGs (graphical representations of LCA procedural information), outperforming baseline techniques like chain-of-thought and one-shot prompting.

Result: Achieves 65% F1-Score (vs 53% for one-shot prompting) across 10 sample data points when compared to 65 real-world LCA documents, with errors mainly due to differences in detail and scope of auxiliary processes.

Conclusion: SpiderGen significantly reduces LCA costs (from $25,000+ to <$1) and time (from 21-person days to <10 minutes), demonstrating practical potential for carbon impact estimation despite some scope/detail challenges.

Abstract: Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the “scope” of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen’s potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than $1 USD in under 10 minutes as compared to the status quo LCA, which can cost over $25000 USD and take up to 21-person days.

[103] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin

Main category: cs.CL

TL;DR: Integrating image data augmentation techniques into grid-tagging models improves recognition of discontinuous named entities, achieving 1-2.5% overall F1 gains and 3.7-8.4% gains specifically for discontinuous entities.

Details

Motivation: NER is challenging for discontinuous entities due to segmentation issues - traditional methods often missegment or miss cross-sentence discontinuous entities, hurting recognition accuracy.

Method: Integrate image data augmentation techniques (cropping, scaling, padding) into grid-tagging models to enhance discontinuous entity recognition and handle segmentation challenges.

Result: Traditional segmentation methods fail on cross-sentence discontinuous entities, but augmented grid models achieve 1-2.5% overall F1 gains and 3.7-8.4% gains for discontinuous entities on CADEC, ShARe13, and ShARe14 datasets.

Conclusion: Image data augmentation integrated with grid-tagging models effectively addresses segmentation and omission issues for discontinuous entities, significantly improving NER performance.

Abstract: Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.

[104] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

Yixing Xu, Chao Li, Xuanwu Yin, Spandan Tiwari, Dong Li, Ashish Sirasao, Emad Barsoum

Main category: cs.CL

TL;DR: Dual LoRA improves LoRA performance by separating low-rank matrices into magnitude and direction groups with ReLU and sign functions to better simulate full fine-tuning parameter updates.

Details

Motivation: LoRA's low-rank assumption often leads to unsatisfactory performance. The authors aim to improve LoRA by incorporating inductive bias to better simulate the parameter updating process of full fine-tuning based on gradient-based optimization algorithms.

Method: Dual LoRA separates low-rank matrices into two groups: magnitude group (controls whether/how far to update parameters) and direction group (decides forward/backward movement). This is achieved by adding a ReLU function to the magnitude group and a sign function to the direction group.

Result: Experiments on various NLP tasks (NLU, commonsense reasoning) with RoBERTa, DeBERTa, and LLaMA-1/2/3 show Dual LoRA consistently outperforms LoRA and its state-of-the-art variants with the same number of trainable parameters.

Conclusion: Dual LoRA effectively improves LoRA performance by better simulating full fine-tuning parameter updates through separated magnitude and direction control, demonstrating consistent superiority over existing LoRA variants.

Abstract: Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language understanding (NLU) and commonsense reasoning datasets on RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.

[105] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation

Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Kotaro Funakoshi, Manabu Okumura

Main category: cs.CL

TL;DR: MBR decoding improves generative Error Span Detection by selecting hypotheses based on similarity to human annotations rather than maximum likelihood, outperforming MAP decoding while reducing computational cost through distillation.

Details

Motivation: Current generative ESD methods use MAP decoding which assumes model probabilities perfectly correlate with human annotations, but this assumption often fails as models sometimes assign higher likelihood to incorrect annotations than human ones.

Method: Apply Minimum Bayes Risk (MBR) decoding to generative ESD using sentence- or span-level similarity functions to select candidate hypotheses based on approximate similarity to human annotations, then distill MBR decisions into a greedy search model to reduce computational cost.

Result: MBR decoding significantly improves span-level performance and generally matches or outperforms MAP at system and sentence levels on WMT24 Metrics Shared Task, with distillation effectively removing inference-time latency bottlenecks.

Conclusion: MBR decoding provides a more effective approach for generative ESD by selecting hypotheses based on similarity to human judgments rather than maximum likelihood, with distillation making it computationally practical for real-world applications.

Abstract: Error Span Detection (ESD) extends automatic machine translation (MT) evaluation by localizing translation errors and labeling their severity. Current generative ESD methods typically use Maximum a Posteriori (MAP) decoding, assuming that the model-estimated probabilities are perfectly correlated with similarity to the human annotation, but we often observe higher likelihood assigned to an incorrect annotation than to the human one. We instead apply Minimum Bayes Risk (MBR) decoding to generative ESD. We use a sentence- or span-level similarity function for MBR decoding, which selects candidate hypotheses based on their approximate similarity to the human annotation. Experimental results on the WMT24 Metrics Shared Task show that MBR decoding significantly improves span-level performance and generally matches or outperforms MAP at the system and sentence levels. To reduce the computational cost of MBR decoding, we further distill its decisions into a model decoded via greedy search, removing the inference-time latency bottleneck.

[106] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Michael H. Coen

Main category: cs.CL

TL;DR: The paper introduces a new evaluation framework for dialogue topic segmentation that separates boundary scoring from boundary selection, using window-tolerant F1 alongside density and alignment diagnostics to better assess segmentation quality across different granularity regimes.

Details

Motivation: Current evaluation practice for dialogue topic segmentation relies on strict boundary matching and F1 metrics, which don't account for varying annotation granularities. Modern LLM-based conversational systems need better segmentation evaluation as they use segmentation to manage conversation history beyond fixed context windows, where poor segmentation degrades efficiency and coherence.

Method: The paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). This separates boundary scoring from boundary selection, allowing evaluation across density regimes rather than at a single operating point. The framework is tested across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions.

Result: Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. Boundary-based metrics are strongly coupled to boundary density, with threshold sweeps producing larger W-F1 changes than switching between segmentation methods.

Conclusion: Topic segmentation should be viewed as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.

Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.

[107] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman, Katie Ekström, Pablo Rosillo-Rodes, Christopher M. Danforth, Peter Sheridan Dodds

Main category: cs.CL

TL;DR: The paper analyzes Heaps’ law in conversations across video chat and movies, finding vocabulary scaling differs by parts of speech.

Details

Motivation: Conversations are fundamental to social connection and well-being, but little work has applied Heaps' law (vocabulary scaling with document length) to conversations or examined how language features affect this scaling.

Method: The researchers measured Heaps’ law for conversations in two distinct mediums: (1) video chat conversations between strangers, and (2) fictional character conversations in movies. They analyzed how vocabulary size scales with conversation length across different parts of speech.

Result: The study found that scaling of vocabulary size differs by parts of speech, meaning different grammatical categories show different patterns of vocabulary growth as conversations progress.

Conclusion: The findings are discussed through behavioral and linguistic frameworks, suggesting that the differential scaling patterns by parts of speech provide insights into how conversations unfold and how language features impact vocabulary dynamics in dialogue.

Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps’ law, which holds that vocabulary size scales with document length. Little work on Heaps’ law has looked at conversation and considered how language features impact scaling. We measure Heaps’ law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

[108] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang

Main category: cs.CL

TL;DR: MobileWorld is a new, more challenging mobile-use benchmark that addresses AndroidWorld’s limitations by featuring realistic scenarios, cross-application workflows, and novel task categories, with agents showing significantly lower success rates.

Details

Motivation: AndroidWorld has become saturated (agents achieving >90% success) and lacks key application categories (e-commerce, enterprise communication) and realistic mobile-use scenarios with vague instructions and hybrid tool usage.

Method: Created MobileWorld with 201 tasks across 20 apps, using open-source alternatives to industry standards (e.g., Mattermost for Slack) for reproducible evaluation. Introduces long-horizon cross-application workflows, agent-user interaction tasks, and MCP-augmented tasks. Developed planner-executor agentic framework with extended action spaces.

Result: MobileWorld is substantially more challenging: requires nearly twice as many completion steps (27.8 vs 14.3) and has 62.2% multi-app tasks (vs 9.5%). Best agentic framework achieves 51.7% success, end-to-end model achieves 20.9% - sharp drop from AndroidWorld’s >90%.

Conclusion: MobileWorld provides a more realistic and challenging benchmark that reveals current agents’ limitations in handling complex mobile-use scenarios, highlighting significant room for improvement in future research.

Abstract: Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks across 20 applications. MobileWorld derives its difficulty from an emphasis on long-horizon, cross-application workflows, requiring nearly twice as many completion steps on average (27.8 vs. 14.3) and featuring a significantly higher proportion of multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. To overcome the limitations of existing environments, MobileWorld achieves a balance between production-grade utility and reproducible evaluation by utilizing open-source alternatives to industry standards (e.g., Mattermost for Slack). This approach enables a fully observable and controlled environment through source code modification and direct backend database access for precise verification. MobileWorld also introduces novel task categories, including agent-user interaction and Model Context Protocol (MCP)-augmented tasks, for evaluating agents in user-aware, hybrid-tool scenarios. To facilitate evaluation, we develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting ample headroom for future research.

[109] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations

Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang, Chuang Hu

Main category: cs.CL

TL;DR: HIC-Bench is a novel evaluation framework that categorizes LLM hallucinations into Intelligent (creative/valuable) and Defective (erroneous) types, enabling systematic study of their interplay in scientific innovation tasks across multiple domains.

Details

Motivation: Current hallucination detection methods focus too narrowly on factual consistency, failing to capture the potential creative value of some hallucinations and struggling to balance creativity with accuracy in scientific tasks.

Method: Proposes HIC-Bench with three core features: (1) Structured IH/DH assessment using multi-dimensional metrics combining TTCT creativity measures with hallucination-specific dimensions, (2) Cross-domain applicability across ten scientific domains, and (3) Dynamic Prompt Optimization using DHP to guide models toward creative yet reliable outputs.

Result: Experimental results show a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized, positioning intelligent hallucinations as catalysts for creativity and drivers of scientific innovation.

Conclusion: HIC-Bench provides a valuable platform for advancing research into the creative intelligence of LLM hallucinations, challenging the conventional view of hallucinations as purely negative errors and revealing their potential to drive scientific innovation.

Abstract: Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.

[110] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao

Main category: cs.CL

TL;DR: UniHetero study shows visual generation enhances understanding only at semantic level (not pixel level) in large-scale vision-language models, revealing superior data scaling and effective visual detail capture through autoregression on input embeddings.

Details

Motivation: To investigate whether visual generation tasks can enhance visual understanding in unified vision-language models at large data scale (>200M samples), challenging the common assumption that generation naturally strengthens understanding.

Method: Proposes UniHetero, a concise unified model structure, and analyzes it under large-scale pretraining. Key approaches include: semantic-level generation (autoregression on high-level visual representations in LLM), avoiding pixel-level objectives that interfere with LLM, and using autoregression on input embeddings instead of vision encoders.

Result: Three key findings: (1) Generation improves understanding only at semantic level, not pixel level (pixel-level objectives degrade understanding at scale). (2) Unified generation-understanding shows superior data scaling and higher data utilization compared to understanding alone. (3) Autoregression on input embeddings effectively captures visual details with less cumulative error and is modality-independent.

Conclusion: Visual generation can enhance understanding in unified vision-language models, but only when operating at semantic level through autoregression on high-level representations. This approach reveals better data scaling properties and provides an effective, modality-independent method for capturing visual details that enables pixel-level generation.

Abstract: Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with the LLM, understanding performance often degrades. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. Unified generation-understanding demonstrates a superior scaling trend compared to understanding alone, revealing a more effective way to learn vision-only knowledge directive from vision modality rather than captioning to text. (3) Autoregression on Input Embedding is effective to capture visual details. Compared to the commonly-used vision encoder, make visual autoregression on input embedding shows less cumulative error and is modality independent, which can be extend to all modalities. The learned semantic representations capture visual information such as objects, locations, shapes, and colors; further enable pixel-level image generation.

cs.CV

[111] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich, Yangming Lee

Main category: cs.CV

TL;DR: Monocular depth estimation for robotic surgery using synthetic priors from Depth Anything V2 with DV-LORA adaptation, achieving SOTA results on SCARED dataset with 98.1% accuracy.

Details

Motivation: Current self-supervised MDE methods for robotic surgery fail in specular, fluid-filled endoscopic environments due to boundary collapse on thin surgical tools and transparent surfaces.

Method: Leverage high-fidelity synthetic priors from Depth Anything V2 architecture, adapt to medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA) for efficient parameter adaptation, and introduce physically-stratified evaluation protocol on SCARED dataset.

Result: Achieves new state-of-the-art with 98.1% accuracy (< 1.25) and reduces Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

Conclusion: The approach successfully bridges synthetic-to-real gap for surgical depth estimation, providing robust performance in challenging endoscopic environments with specular reflections and thin structures.

Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

Yizhi Liu, Ruitao Pu, Shilin Xu, Yingke Chen, Quan-Hui Liu, Yuan Sun

Main category: cs.CV

TL;DR: NIRNL is a robust cross-modal retrieval framework that handles noisy labels through cross-modal margin preserving and neighbor-aware instance refining with tailored optimization strategies.

Details

Motivation: Cross-modal retrieval suffers from noisy labels in multi-modal data collection, which degrades model performance. Existing robust methods fail to simultaneously achieve high performance ceilings, calibration reliability, and data utilization rates.

Method: Proposes NIRNL framework with two main components: 1) Cross-modal Margin Preserving (CMP) to adjust relative distances between positive/negative pairs for better discrimination; 2) Neighbor-aware Instance Refining (NIR) to identify pure, hard, and noisy subsets through cross-modal neighborhood consensus, then apply tailored optimization strategies for each subset.

Result: Extensive experiments on three benchmark datasets show NIRNL achieves state-of-the-art performance with remarkable robustness, especially under high noise rates.

Conclusion: NIRNL effectively addresses noisy label problems in cross-modal retrieval by maximizing data utilization while mitigating error propagation, outperforming existing robust methods.

Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

[113] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments

Surya Rayala, Marcos Quinones-Grueiro, Naveeduddin Mohammed, Ashwin T S, Benjamin Goldberg, Randall Spain, Paige Lawton, Gautam Biswas

Main category: cs.CV

TL;DR: Video-based assessment pipeline for urban warfare training using computer vision to extract performance metrics from training videos without additional hardware.

Details

Motivation: Traditional performance assessment methods for military training (like Enter and Clear the Room drills) rely on costly sensors or subjective human observation, limiting scalability and accuracy. There's a need for objective, automated assessment of cognitive, psychomotor, and teamwork skills in synthetic training environments.

Method: Computer vision models extract 2D skeletons, gaze vectors, and movement trajectories from training videos. Task-specific metrics measure psychomotor fluency, situational awareness, and team coordination. These feed into an extended Cognitive Task Analysis hierarchy with weighted combinations to generate overall performance scores.

Result: Demonstrated with real-world ECR drills, providing actionable domain-specific metrics capturing individual and team performance. The system supports After Action Reviews with interactive dashboards in Gamemaster and GIFT frameworks.

Conclusion: Video-based assessment enables scalable evaluation in Synthetic Training Environments. Limitations include tracking difficulties and ground-truth validation. Future work includes expanding to 3D video data and leveraging video analysis for broader STE applications.

Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.

[114] Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

Main category: cs.CV

TL;DR: D²VLM is a factorized video-language model that decouples temporal grounding and textual response learning using evidence tokens and factorized preference optimization, achieving better event-level video understanding.

Details

Motivation: Existing video-language models struggle with accurate temporal grounding for event-level perception. The authors observe that temporal grounding and textual response form a logical hierarchy where grounding is foundational, but current approaches handle them in a coupled manner without clear structure, leading to suboptimal performance.

Method: Proposes D²VLM framework with: 1) “grounding then answering with evidence referencing” paradigm using evidence tokens for event-level visual semantic capture, 2) Factorized Preference Optimization (FPO) algorithm that explicitly incorporates probabilistic temporal grounding modeling into optimization, and 3) a synthetic dataset for factorized preference learning with explicit temporal grounding.

Result: Experiments on various tasks demonstrate clear advantages over existing approaches, showing improved temporal grounding and textual response capabilities for video understanding.

Conclusion: Factorized learning of temporal grounding and textual response with explicit dependency modeling through evidence tokens and FPO significantly improves video-language model performance on event-level perception tasks.

Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a “grounding then answering with evidence referencing” paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

[115] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala

Main category: cs.CV

TL;DR: PFP is a neural network that compresses long videos into short contexts while preserving high-frequency details of individual frames, enabling efficient long-term memory for autoregressive video models.

Details

Motivation: To enable long video modeling with reduced computational cost by compressing lengthy video sequences into compact representations while maintaining visual fidelity of individual frames.

Method: PFP uses a neural network structure with explicit pretraining objective to preserve high-frequency details of single frames at arbitrary temporal positions, compressing 20-second videos into ~5k length contexts.

Result: The model successfully compresses videos while allowing random frame retrieval with perceptually preserved appearances, and can be fine-tuned as memory encoders for autoregressive video models with low context cost and minimal fidelity loss.

Conclusion: PFP provides an effective framework for video compression and long-term memory in autoregressive video models, with trade-offs in neural architecture design that enable practical long-history video processing.

Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[116] LiftProj: Space Lifting and Projection-Based Panorama Stitching

Yuan Jia, Ruimin Wu, Rui Song, Jiaojiao Li, Bin Song

Main category: cs.CV

TL;DR: A 3D-based panoramic stitching framework that lifts images to 3D point clouds for global fusion, then projects to cylindrical panorama, reducing ghosting and distortions in complex scenes.

Details

Motivation: Traditional 2D homography and mesh warping methods fail in real 3D scenes with multiple depth layers and occlusions, causing ghosting, structural bending, and stretching distortions, especially in multi-view and 360° closed-loop stitching scenarios.

Method: 1) Lift each input image to dense 3D point representation in unified coordinate system with global cross-view fusion using confidence metrics; 2) Establish unified 3D projection center and use equidistant cylindrical projection to map fused data to panoramic manifold; 3) Perform hole filling in canvas domain to address unknown regions from viewpoint transitions.

Result: Experimental evaluations show the method substantially mitigates geometric distortions and ghosting artifacts in scenarios with significant parallax and complex occlusions, yielding more natural and consistent panoramic results.

Conclusion: The framework successfully re-conceptualizes stitching from 2D warping to 3D consistency paradigm, flexibly incorporating various 3D lifting and completion modules to handle complex real-world scenes.

Abstract: Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.

[117] Lifelong Domain Adaptive 3D Human Pose Estimation

Qucheng Peng, Hongfei Xue, Pu Wang, Chen Chen

Main category: cs.CV

TL;DR: First work to introduce lifelong domain adaptation to 3D human pose estimation, addressing challenges of non-stationary target domains and catastrophic forgetting through a novel GAN framework.

Details

Motivation: 3D HPE struggles with generalization to diverse real-world scenarios due to reliance on controlled environment data. Existing domain adaptation approaches overlook non-stationary target datasets and catastrophic forgetting when adapting to multiple domains sequentially.

Method: Proposes a novel GAN framework with 3D pose generators, 2D pose discriminator, and 3D pose estimator. Introduces a 3D pose generator paradigm integrating pose-aware, temporal-aware, and domain-aware knowledge to adapt to current domain while preserving previous domain knowledge.

Result: Superior performance demonstrated through extensive experiments on diverse domain adaptive 3D HPE datasets, effectively mitigating domain shifts and combating catastrophic forgetting.

Conclusion: The proposed lifelong domain adaptation framework successfully addresses the challenges of adapting 3D HPE to non-stationary target domains while preserving knowledge from previous domains, representing a significant advancement in making 3D pose estimation more robust for real-world applications.

Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain’s adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

[118] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

Rongji Xun, Junjie Yuan, Zhongjie Wang

Main category: cs.CV

TL;DR: HaineiFRDM is a diffusion-based film restoration framework that enables high-resolution restoration on limited GPU memory using patch-wise training, global-local frequency modules, and global residual connections, outperforming existing open-source methods.

Details

Motivation: Existing open-source film restoration methods have limited performance due to training with low-quality synthetic data, noisy optical flows, and inability to handle high-resolution films. There's a need for better restoration tools that leverage modern AI capabilities to help human experts restore indistinguishable film defects.

Method: 1) Patch-wise training/testing strategy for high-resolution restoration on 24GB VRAM GPUs; 2) Position-aware Global Prompt and Frame Fusion Modules; 3) Global-local frequency module for consistent texture reconstruction across patches; 4) Two-stage restoration: low-resolution result as global residual to mitigate blocky artifacts; 5) Construction of a comprehensive film restoration dataset with real-degraded films and realistic synthetic data.

Result: The model demonstrates superior defect restoration ability over existing open-source methods through comprehensive experimental results. The framework successfully enables high-resolution film restoration on consumer-grade hardware.

Conclusion: HaineiFRDM effectively addresses limitations of existing open-source film restoration methods by leveraging diffusion models’ content understanding, enabling high-resolution restoration on accessible hardware, and providing a practical solution for film restoration experts. The code and dataset will be publicly released.

Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source methods.We propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model’s powerful content-understanding ability to help human expert better restore indistinguishable film defects.Specifically, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion Modules.Also, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching process.Furthermore, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic data.Comprehensive experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

[119] MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework

Krithika Iyer, Austin Tapp, Athelia Paulli, Gabrielle Dickerson, Syed Muhammad Anwar, Natasha Lepore, Marius George Linguraru

Main category: cs.CV

TL;DR: Deep learning pipeline transforms pediatric T1-weighted MRIs into synthetic CTs for cranial bone segmentation and suture analysis, enabling non-invasive assessment of cranial development without radiation exposure.

Details

Motivation: CT scans provide detailed cranial and sutural information but involve harmful ionizing radiation for children. MRI is radiation-free but cannot visualize cranial sutures or assess bone density. There's a critical need for non-invasive methods to evaluate pediatric cranial development.

Method: Proposed a deep learning pipeline using domain-specific variational autoencoders to transform T1-weighted MRIs of children (0.2-2 years) into synthetic CTs. The method predicts cranial bone segmentation, generates suture probability heatmaps, and derives direct suture segmentation from the heatmaps.

Result: Synthetic CTs achieved 99% structural similarity and Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained 85% Dice coefficient across seven cranial bones, and sutures achieved 80% Dice. Equivalence between sCTs and real CTs was confirmed statistically (TOST p < 0.05).

Conclusion: This is the first pediatric cranial CT synthesis framework enabling suture segmentation from MRI, bridging critical gaps in non-invasive cranial evaluation. The method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs despite MRI’s limited bone depiction.

Abstract: Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI’s limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.

[120] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema, Eliza Mace, Hunter Brown, Heidys Cabrera, Nick Krall, Matthew O’Neill, Shivangi Sarkar, Lowell Weissman, Eric Hughes, Guido Zarrella

Main category: cs.CV

TL;DR: The paper explores scaling laws for training foundation models on massive-scale electro-optical satellite data, finding that performance remains data-limited even at petascale, with implications for remote sensing AI development.

Details

Motivation: Current scaling laws for AI models are well-established for natural images with abundant internet data, but poorly understood for high-value domains like remote sensing where data is more limited and specialized. There's a need to develop practical techniques for training foundation models on massive-scale EO datasets that exceed current state-of-the-art by orders of magnitude.

Method: The researchers use over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox to train progressively larger vision transformer (ViT) backbones. They systematically scale model capacity while analyzing performance at petascale, reporting both success and failure modes observed during this massive-scale training.

Result: Even at petascale (over a quadrillion pixels), performance remains consistent with a data-limited regime rather than a model parameter-limited one. The study provides practical insights into scaling behaviors and identifies challenges in bridging domain gaps across additional remote sensing modalities.

Conclusion: The findings should inform data-collection strategies, compute budgets, and optimization schedules for developing frontier-scale remote sensing foundation models. The research demonstrates that current remote sensing AI development is fundamentally data-limited, not model-limited, even at unprecedented scales.

Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

[121] Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: EASI provides a comprehensive evaluation framework for assessing multimodal LLMs’ spatial intelligence capabilities, revealing that while GPT-5 shows unprecedented strength, all models still significantly lag behind human performance on spatial reasoning tasks.

Details

Motivation: Despite remarkable progress in multimodal models, they still exhibit significant limitations in spatial understanding and reasoning - a critical capability for artificial general intelligence in the physical world. With the release of GPT-5, there's a timely need to systematically evaluate leading models' spatial intelligence capabilities.

Method: Proposed EASI (Evaluation of multimodAl LLMs on Spatial Intelligence) with a comprehensive taxonomy of spatial tasks unifying existing benchmarks and newly curated ones. Conducted systematic evaluation across eight key benchmarks using over ten billion total tokens, plus qualitative evaluation on intuitive human scenarios that challenge advanced models.

Result: 1) GPT-5 demonstrates unprecedented strength in spatial intelligence but 2) still falls significantly short of human performance across broad SI-tasks. 3) SI-tasks expose greater model capability deficiencies than non-SI tasks, and 4) proprietary models don’t show decisive advantage on the most difficult tasks.

Conclusion: Spatial intelligence remains a significant challenge for current multimodal models. EASI provides an open-source framework with standardized interfaces and an ongoing leaderboard to accelerate collective progress toward robust spatial intelligence in AI systems.

Abstract: Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.

[122] Learning to learn skill assessment for fetal ultrasound scanning

Yipei Wang, Qianye Yang, Lior Drukker, Aris T. Papageorghiou, Yipeng Hu, J. Alison Noble

Main category: cs.CV

TL;DR: Proposes a bi-level optimization framework for automated ultrasound skill assessment that predicts skills based on task performance in fetal ultrasound images without manual skill ratings.

Details

Motivation: Traditional ultrasound skill assessment is subjective and time-intensive, relying on expert supervision. Existing automated methods use supervised learning with predetermined factors, limiting analysis to assumed skill determinants.

Method: A novel bi-level optimization framework with two components: a clinical task predictor and a skill predictor. The networks are jointly optimized by refining both simultaneously to assess skills based on how well tasks are performed on acquired ultrasound images.

Result: Validated on real-world clinical ultrasound videos of fetal head scanning. Results demonstrate feasibility of predicting ultrasound skills by quantifying optimized task performance as a skill indicator.

Conclusion: The proposed framework successfully automates ultrasound skill assessment by linking skill prediction directly to task performance metrics, eliminating the need for manual skill ratings and overcoming limitations of traditional supervised approaches.

Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.

[123] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Devendra K. Jangid, Ripon K. Saha, Dilshan Godaliyadda, Jing Li, Seok-Jun Lee, Hamid R. Sheikh

Main category: cs.CV

TL;DR: The paper introduces a Feature-to-Image Diffusion (F2IDiff) Foundation Model for Single Image Super-Resolution that uses DINOv2 features instead of text features to provide stricter, hallucination-free conditioning suitable for smartphone photography.

Details

Motivation: Current T2IDiff-based SISR models cause undesirable hallucinations in smartphone photography where LR images have high fidelity and require minimal generation. Text features are too high-level for subtle textures and small patches, making them unsuitable for smartphone-scale SISR.

Method: Proposes a Feature-to-Image Diffusion (F2IDiff) Foundation Model that uses lower-level DINOv2 features for conditioning instead of text features. These features provide stricter conditioning while being rich descriptors of small patches.

Result: The F2IDiff model addresses the shortcomings of T2IDiff models by providing hallucination-free generation suitable for smartphone photography where LR images are high-resolution (≥12MP) and require minimal, faithful enhancement.

Conclusion: Lower-level feature conditioning (DINOv2) is more suitable than text conditioning for smartphone SISR applications, providing stricter control over generation while maintaining rich descriptive power for small patches.

Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM’s conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

[124] Scaling Spatial Intelligence with Multimodal Foundation Models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: SenseNova-SI family scales multimodal foundation models to improve spatial intelligence using 8M diverse data samples, achieving state-of-the-art performance across multiple spatial benchmarks while maintaining strong general multimodal understanding.

Details

Motivation: Despite progress in multimodal foundation models, they still exhibit surprising deficiencies in spatial intelligence. The work aims to address this gap by scaling up multimodal models specifically for spatial reasoning capabilities.

Method: Built upon established multimodal foundations (Qwen3-VL, InternVL3, Bagel), the authors systematically curated SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. They take a principled approach to constructing high-performing and robust spatial intelligence.

Result: SenseNova-SI achieves unprecedented performance across spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (84.9% on MMBench-En).

Conclusion: The work demonstrates successful scaling of multimodal models for spatial intelligence, analyzes data scaling effects, emergent generalization capabilities, overfitting risks, and validates downstream applications. All models are publicly released as an ongoing project to facilitate further research.

Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

Yulong Zou, Bo Liu, Cun-Jing Zheng, Yuan-ming Geng, Siyue Li, Qiankun Zuo, Shuihua Wang, Yudong Zhang, Jin Hong

Main category: cs.CV

TL;DR: A meta-guided multimodal learning framework for brain tumor segmentation that handles incomplete MRI data through adaptive modality fusion and consistency regularization.

Details

Motivation: Multimodal MRI data is crucial for brain tumor segmentation but often incomplete in clinical practice, creating a need for methods that can effectively utilize available incomplete multimodal information.

Method: Proposes MGML framework with two components: 1) Meta-parameterized adaptive modality fusion (Meta-AMF) that generates adaptive soft-label supervision for coherent multimodal fusion under varying input conditions, and 2) Consistency regularization module to enhance segmentation performance and framework robustness.

Result: Achieved superior performance on BraTS2020 and BraTS2023 datasets compared to state-of-the-art methods. On BraTS2020, obtained average Dice scores of 87.55 (WT), 79.36 (TC), and 62.67 (ET) across fifteen missing modality combinations.

Conclusion: The MGML framework effectively handles incomplete multimodal MRI data without altering original model architecture, can be integrated into training pipelines, and demonstrates strong performance in brain tumor segmentation with missing modalities.

Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

[126] Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation

Hualin Ye, Bingxi Liu, Jixiang Du, Yu Qin, Ziyi Chen, Hong Zhang

Main category: cs.CV

TL;DR: A novel cross-view geo-localisation system using DINOv2 backbone with convolution adapter, multi-scale channel reallocation, and MoE-enhanced aggregation to handle view-point discrepancies.

Details

Motivation: Cross-view geo-localisation faces significant challenges due to view-point discrepancies between query and database images, making feature aggregation and alignment difficult.

Method: Three key improvements: 1) DINOv2 backbone with convolution adapter fine-tuning, 2) multi-scale channel reallocation module for spatial representation diversity, 3) improved aggregation module with Mixture-of-Experts routing in cross-attention framework.

Result: Achieves competitive performance on University-1652 and SUES-200 datasets with fewer trained parameters compared to existing methods.

Conclusion: The proposed system effectively addresses cross-view discrepancies through adaptive feature processing and achieves state-of-the-art performance with parameter efficiency.

Abstract: Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.

[127] Kinematic-Based Assessment of Surgical Actions in Microanastomosis

Yan Meng, Daniel Donoho, Marcelle Altshuler, Omar Arnaout

Main category: cs.CV

TL;DR: AI framework for automated action segmentation and skill assessment in microanastomosis surgery using computer vision and machine learning.

Details

Motivation: Microanastomosis requires precise surgical skills, but current assessment methods are subjective, time-consuming, and inconsistent. There's a need for objective, automated evaluation systems for surgical training.

Method: Three-component AI framework: (1) YOLO+DeepSORT for instrument tip tracking, (2) self-similarity matrix for action boundary detection and unsupervised clustering for segmentation, (3) supervised classification for skill proficiency assessment.

Result: Achieved 92.4% frame-level action segmentation accuracy and 85.5% overall skill classification accuracy on 58 expert-rated microanastomosis videos.

Conclusion: The AI framework provides objective, real-time feedback for microsurgical training, enabling standardized, data-driven assessment protocols in high-stakes surgical environments.

Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.

[128] U-Net-Like Spiking Neural Networks for Single Image Dehazing

Huibin Li, Haoran Liu, Mingzhe Liu, Yulong Xiao, Peng Li, Guibin Zan

Main category: cs.CV

TL;DR: DehazeSNN: A novel image dehazing method using Spiking Neural Networks with U-Net architecture that achieves state-of-the-art performance with reduced computational cost.

Details

Motivation: Traditional dehazing methods rely on atmospheric scattering models, while deep learning approaches (CNNs and Transformers) have limitations: CNNs struggle with long-range dependencies, and Transformers require heavy computational resources. There's a need for efficient dehazing that balances performance and computational efficiency.

Method: Proposes DehazeSNN, a U-Net-like architecture integrated with Spiking Neural Networks (SNNs). Uses Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) to enhance cross-channel communication and capture multi-scale image features while managing local and long-range dependencies efficiently.

Result: Extensive experiments show DehazeSNN is highly competitive with state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with smaller model size and fewer multiply-accumulate operations (reduced computational burden).

Conclusion: DehazeSNN provides an effective solution for image dehazing that overcomes limitations of both CNNs and Transformers, offering superior performance with computational efficiency through SNN integration and novel architectural components.

Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.

[129] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: T2VAttack: First comprehensive study of adversarial attacks on Text-to-Video diffusion models, revealing vulnerabilities through semantic and temporal attacks via prompt modifications.

Details

Motivation: Despite rapid advancements in Text-to-Video (T2V) diffusion models for generating high-quality videos from text, their vulnerability to adversarial attacks remains largely unexplored. The paper aims to investigate these security vulnerabilities from both semantic and temporal perspectives.

Method: Proposes T2VAttack framework with two attack objectives: semantic (video-text alignment) and temporal (temporal dynamics). Two attack methods: T2VAttack-S (identifies critical words and replaces with synonyms via greedy search) and T2VAttack-I (iteratively inserts optimized words with minimal perturbation). Evaluated on state-of-the-art T2V models including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo.

Result: Experiments show that even minor prompt modifications (single word substitution or insertion) can cause substantial degradation in both semantic fidelity and temporal dynamics. This reveals critical vulnerabilities in current T2V diffusion models to adversarial attacks.

Conclusion: Current T2V diffusion models are vulnerable to adversarial attacks through subtle prompt modifications. The study highlights the need for improved robustness in these models and provides a comprehensive framework for evaluating their security vulnerabilities.

Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

[130] Effective Online Exam Proctoring by Combining Lightweight Face Detection and Deep Recognition

Xu Yang, Juantao Zhong, Daoyuan Wu, Xiao Yi, Jimmy H. M. Lee, Tan Lee, Peng Han

Main category: cs.CV

TL;DR: iExam is an online exam proctoring system that uses real-time face detection and deep face recognition to monitor student presence and detect cheating behaviors during Zoom-based exams.

Details

Motivation: Online exams via platforms like Zoom are widespread but ensuring exam integrity is challenging due to difficulty monitoring multiple video feeds in real time. Current systems lack efficient monitoring and post-exam analysis capabilities.

Method: Combines lightweight real-time face detection with deep face recognition for post-exam analysis. System monitors student presence, detects abnormal behaviors (face disappearance, rotation, identity substitution), uses enhanced OCR for dynamic Zoom name tags, and is optimized for resource-efficient training/inference on standard teacher devices.

Result: Achieves 90.4% accuracy in real-time face detection and 98.4% accuracy in post-exam face recognition with low overhead. Demonstrates practicality and effectiveness for online exam proctoring.

Conclusion: iExam provides a practical and effective solution for online exam proctoring by combining real-time monitoring with post-exam analysis, addressing key challenges in exam integrity for video-conferencing-based assessments.

Abstract: Online exams conducted via video conferencing platforms such as Zoom have become widespread, yet ensuring exam integrity remains challenging due to the difficulty of monitoring multiple video feeds in real time. We present iExam, an online exam proctoring and analysis system that combines lightweight real-time face detection with deep face recognition for postexam analysis. iExam assists invigilators by monitoring student presence during exams and identifies abnormal behaviors, such as face disappearance, face rotation, and identity substitution, from recorded videos. The system addresses three key challenges: (i)efficient real-time video capture and analysis, (ii) automated student identity labeling using enhanced OCR on dynamic Zoom name tags, and (iii) resource-efficient training and inference on standard teacher devices. Extensive experiments show that iExam achieves 90.4% accuracy in real-time face detection and 98.4% accuracy in post-exam recognition with low overhead, demonstrating its practicality and effectiveness for online exam proctoring.

[131] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

Yuang Jia, Jinlong Wang, Jiayi Zhao, Chunlam Li, Shunzhou Wang, Wei Gao

Main category: cs.CV

TL;DR: A novel view extrapolation method for autonomous driving that uses deformable 4D Gaussians and iterative diffusion refinement without expensive sensors or annotations.

Details

Motivation: Existing view extrapolation methods rely on expensive sensors (LiDAR) or labor-intensive annotations (3D bounding boxes, lane markings), limiting real-world deployment. The paper aims to develop a solution using only images and optional camera poses.

Method: 1) Estimate global static and per-frame dynamic point clouds from images, fusing them into unified representation. 2) Use deformable 4D Gaussian framework for scene reconstruction. 3) Train video diffusion model on degraded renders from initial 4D Gaussian. 4) Iteratively refine progressively shifted Gaussian renderings using diffusion model, then use enhanced results to retrain 4DGS. 5) Repeat until reaching target extrapolated viewpoints.

Result: The method produces higher-quality images at novel extrapolated viewpoints compared to baselines, demonstrating effectiveness without requiring expensive sensors or annotations.

Conclusion: The proposed approach enables effective view extrapolation in autonomous driving scenarios using only images and optional camera poses, overcoming limitations of prior methods that depend on expensive sensors or annotations.

Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

[132] HIDFlowNet: A Flow-Based Deep Network for Hyperspectral Image Denoising

Qizhou Wang, Li Pang, Xiangyong Cao, Zhiqiang Tian, Deyu Meng

Main category: cs.CV

TL;DR: HIDFlowNet: A flow-based hyperspectral image denoising network that learns conditional distributions to address ill-posed nature of HSI denoising and decouple low/high-frequency learning.

Details

Motivation: Existing DL-based HSI denoising methods have two main limitations: 1) They treat denoising as deterministic mapping (ignoring ill-posed nature where multiple clean HSIs can produce same noisy HSI), leading to over-smoothing; 2) They fail to properly decouple learning of low-frequency and high-frequency components, where noise resides in high-frequency.

Method: Proposes HIDFlowNet based on generative flow model with invertible decoder and conditional encoder. Invertible decoder uses stacked invertible conditional blocks (ICBs) to capture local high-frequency details. Conditional encoder uses down-sampling and transformers to extract global low-frequency information. The network learns conditional distribution of clean HSI given noisy HSI, enabling sampling of diverse clean outputs.

Result: Extensive experiments on simulated and real HSI datasets show HIDFlowNet achieves better or comparable results compared with state-of-the-art methods.

Conclusion: HIDFlowNet effectively addresses ill-posed nature of HSI denoising by learning conditional distributions and properly decoupling low/high-frequency learning, overcoming limitations of deterministic DL approaches and achieving competitive performance.

Abstract: Hyperspectral image (HSI) denoising is essentially ill-posed since a noisy HSI can be degraded from multiple clean HSIs. However, existing deep learning (DL)-based approaches only restore one clean HSI from the given noisy HSI with a deterministic mapping, thus ignoring the ill-posed issue and always resulting in an over-smoothing problem. Additionally, these DL-based methods often neglect that noise is part of the high-frequency component and their network architectures fail to decouple the learning of low-frequency and high-frequency. To alleviate these issues, this paper proposes a flow-based HSI denoising network (HIDFlowNet) to directly learn the conditional distribution of the clean HSI given the noisy HSI and thus diverse clean HSIs can be sampled from the conditional distribution. Overall, our HIDFlowNet is induced from the generative flow model and is comprised of an invertible decoder and a conditional encoder, which can explicitly decouple the learning of low-frequency and high-frequency information of HSI. Specifically, the invertible decoder is built by staking a succession of invertible conditional blocks (ICBs) to capture the local high-frequency details. The conditional encoder utilizes down-sampling operations to obtain low-resolution images and uses transformers to capture correlations over a long distance so that global low-frequency information can be effectively extracted. Extensive experiments on simulated and real HSI datasets verify that our proposed HIDFlowNet can obtain better or comparable results compared with other state-of-the-art methods.

[133] Anomaly detection in satellite imagery through temporal inpainting

Bertrand Rouet-Leduc, Claudia Hulbert

Main category: cs.CV

TL;DR: Deep learning inpainting model detects surface changes from satellite time series by predicting expected appearance and identifying anomalies as prediction-observation discrepancies, achieving 3x better sensitivity than traditional methods.

Details

Motivation: Current satellite-based surface change detection faces challenges from atmospheric noise, seasonal variations, and sensor artifacts, limiting sensitivity for rapid disaster response and environmental monitoring.

Method: Train an inpainting model based on SATLAS foundation model to reconstruct the last frame of Sentinel-2 time series from preceding acquisitions, using globally distributed training data across diverse climates and land cover types.

Result: Method detects earthquake-triggered surface ruptures from 2023 Turkey-Syria earthquake with higher sensitivity/specificity than temporal median or Reed-Xiaoli detectors, achieving ~3x lower detection thresholds than baselines.

Conclusion: Deep learning temporal redundancy approach enables unprecedented sensitivity for surface change detection, providing path to automated global-scale monitoring using freely available multi-spectral satellite data.

Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

[134] GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention

Jun Ding, Shang Gao

Main category: cs.CV

TL;DR: GCA-ResUNet: A lightweight CNN-based medical image segmentation framework with Grouped Coordinate Attention module that improves global context modeling while maintaining computational efficiency.

Details

Motivation: U-Net based CNNs struggle with long-range dependencies in multi-organ segmentation and low-contrast regions due to local receptive fields. Transformers address this but require high computational resources and large datasets, making them impractical for clinical deployment.

Method: Proposes GCA-ResUNet with Grouped Coordinate Attention module that: 1) Decouples channel-wise context modeling into multiple groups to handle semantic heterogeneity, 2) Integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes, 3) Maintains plug-and-play design for CNN backbones.

Result: Achieves Dice scores of 86.11% on Synapse and 92.64% on ACDC benchmarks, outperforming CNN and Transformer methods including Swin-UNet and TransUNet. Particularly effective for small anatomical structures with complex boundaries.

Conclusion: GCA-ResUNet provides favorable trade-off between accuracy and computational efficiency, offering practical solution for clinical deployment in resource-constrained environments while maintaining strong segmentation performance.

Abstract: Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical environments.In this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.

[135] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Haotang Li, Zhenyu Qi, Hao Qin, Huanrui Yang, Sen He, Kebin Peng

Main category: cs.CV

TL;DR: GASeg is a self-supervised semantic segmentation framework that addresses appearance ambiguities by integrating geometric topological information through a Differentiable Box-Counting module, Topological Augmentation, and GALoss for cross-modal alignment.

Details

Motivation: Self-supervised semantic segmentation methods often fail due to over-reliance on unstable appearance-based features like shadows, glare, and local textures, which are ambiguous in real-world scenarios.

Method: Proposes GASeg with three key components: 1) Differentiable Box-Counting (DBC) module to extract multi-scale topological statistics from geometric and appearance features, 2) Topological Augmentation (TopoAug) using adversarial morphological operators to simulate real-world ambiguities, and 3) GALoss for explicit cross-modal alignment between geometric and appearance features.

Result: Achieves state-of-the-art performance on four benchmarks including COCO-Stuff, Cityscapes, and PASCAL, demonstrating effectiveness in handling appearance ambiguities through geometric-topological integration.

Conclusion: Bridging geometry and appearance via topological information is effective for robust self-supervised semantic segmentation, addressing limitations of appearance-only approaches in ambiguous scenarios.

Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[136] Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

Main category: cs.CV

TL;DR: Zoomer is a visual prompting framework that improves multimodal LLM performance by providing token-efficient, detail-preserving image representations through region-adaptive attention and flexible token allocation.

Details

Motivation: Current MLLMs like GPT-4o, Gemini Pro, and Claude 3.5 often hallucinate in real-world scenarios, especially with small objects or fine spatial context, due to lack of region-adaptive attention and inflexible token budgets causing uniform downsampling and information loss.

Method: Zoomer integrates three components: (1) prompt-aware emphasis module to highlight semantically relevant regions, (2) spatial-preserving orchestration schema to maintain object relationships, and (3) budget-aware strategy to adaptively allocate tokens between global context and local details.

Result: Extensive experiments on nine benchmarks with three commercial MLLMs show Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67%.

Conclusion: Zoomer establishes a principled methodology for robust, resource-aware multimodal understanding in black-box settings where model internals are inaccessible.

Abstract: Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real world scenarios especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer, a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to adaptively allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.

[137] Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge

Tae Ha Park, Simone D’Amico

Main category: cs.CV

TL;DR: Novel pipeline for 3D spacecraft reconstruction using 3D Gaussian Splatting with Sun position priors to handle dynamic space lighting, enabling both geometry recovery and camera pose estimation.

Details

Motivation: Traditional 3DGS requires static scenes but spaceborne imagery during RPO has dynamic lighting. Accurate photometric rendering is crucial for downstream pose estimation tasks in spacecraft operations.

Method: Use 3D Gaussian Splatting to represent spacecraft geometry and appearance, incorporating Sun position priors (estimated by servicer spacecraft) into training pipeline to handle changing illumination conditions.

Result: 3DGS models learn to adapt to rapidly changing space illumination, reflect global shadowing and self-occlusion, improving photometric quality for better pose estimation.

Conclusion: Incorporating Sun position knowledge into 3DGS training enables accurate 3D reconstruction and photometric rendering of spacecraft under dynamic space lighting, supporting RPO operations.

Abstract: This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target’s geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun’s position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.

[138] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu, Hui Li, Yiyun Su

Main category: cs.CV

TL;DR: Hilbert-VLM: A two-stage fusion framework using Hilbert space-filling curves with SAM2 architecture for 3D medical image analysis, achieving 82.35% Dice score on BraTS2021 segmentation and 78.85% classification accuracy.

Details

Motivation: VLMs show promise for automated medical diagnosis but struggle with 3D multimodal medical images - specifically integrating complementary information and detecting subtle pathological features.

Method: Two-stage fusion framework: 1) HilbertMed-SAM module for lesion segmentation using Hilbert space-filling curves in Mamba SSM scanning to preserve 3D spatial locality, plus HMCA mechanism and scale-aware decoder; 2) Prompt enhancement module unifies masks and textual attributes to guide VLM for disease classification.

Result: Achieves 82.35% Dice score on BraTS2021 segmentation benchmark and 78.85% diagnostic classification accuracy (ACC), demonstrating improved accuracy for medical VLM-based analysis.

Conclusion: Hilbert-VLM offers substantial potential to improve accuracy and reliability of medical VLM-based analysis by effectively handling 3D multimodal medical images through systematic architectural redesign and enhanced prompting.

Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

[139] On Exact Editing of Flow-Based Diffusion Models

Zixiang Li, Yue Song, Jianing Peng, Ting Liu, Jun Huang, Xiaochao Qu, Luoqi Liu, Wei Wang, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: CVC introduces a dual-perspective velocity correction framework for flow-based diffusion editing that decomposes latent evolution into structure-preserving and semantically-guided branches, with posterior-consistent updates to address velocity errors and maintain fidelity.

Details

Motivation: Current flow-based diffusion editing methods suffer from accumulated velocity errors in latent trajectories, leading to semantic inconsistency and loss of structural fidelity during image transformations.

Method: CVC reformulates flow-based editing as distribution transformation with dual-perspective velocity conversion: structure-preserving branch maintains source consistency, semantically-guided branch drives controlled deviation toward target. Uses posterior-consistent updates derived from Empirical Bayes Inference and Tweedie correction to compensate for velocity errors.

Result: CVC achieves stable and interpretable latent dynamics with faithful reconstruction and smooth local semantic conversion. Demonstrates superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

Conclusion: The proposed Conditioned Velocity Correction framework provides a principled solution to velocity error accumulation in flow-based diffusion editing, enabling more accurate and stable image transformations through mathematically grounded error compensation.

Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

[140] FitControler: Toward Fit-Aware Virtual Try-On

Lu Yang, Yicheng Liu, Yanan Li, Xiang Bai, Hao Lu

Main category: cs.CV

TL;DR: FitControler is a plug-in module for virtual try-on models that enables customized garment fit control by generating fit-aware layouts and injecting them into existing VTON systems.

Details

Motivation: Current VTON models focus on rendering garment details but neglect garment fit, which is crucial for holistic style coordination. Garment fit determines how clothing aligns with the body and is fundamental in fashion design.

Method: FitControler has two main components: 1) a fit-aware layout generator that redraws body-garment layouts using garment-agnostic representations, and 2) a multi-scale fit injector that delivers layout cues for layout-driven VTON. The system is trained on Fit4Men dataset with 13,000 body-garment pairs of different fits.

Result: Extensive experiments show FitControler can work with various VTON models and achieve accurate fit control. The method is validated using newly introduced fit consistency metrics on the comprehensive Fit4Men dataset.

Conclusion: FitControler successfully addresses the gap in fit-aware virtual try-on by providing a learnable plug-in that enables customized fit control while maintaining compatibility with existing VTON models.

Abstract: Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style – garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.

[141] Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression

Huanxiong Liang, Yunuo Chen, Yicheng Pan, Sixian Wang, Jincheng Dai, Guo Lu, Wenjun Zhang

Main category: cs.CV

TL;DR: Structure-guided 2D Gaussian Splatting with adaptive quantization and geometry regularization improves rate-distortion efficiency while maintaining fast decoding.

Details

Motivation: Existing 2D Gaussian Splatting methods allocate representation capacity and parameter precision without considering image structure, limiting rate-distortion efficiency at low bitrates.

Method: Three key components: 1) Structure-guided initialization for localized Gaussian distribution, 2) Adaptive bitwidth quantization of covariance parameters based on region complexity, 3) Geometry-consistent regularization aligning Gaussian orientations with local gradients.

Result: Achieves 43.44% BD-rate reduction on Kodak and 29.91% on DIV2K compared to baseline GSImage, while maintaining over 1000 FPS decoding speed.

Conclusion: The proposed structure-guided allocation principle significantly improves 2DGS representation power and rate-distortion performance while preserving native decoding speed.

Abstract: Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.

[142] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: MF-RSVLM is a multi-feature fusion remote sensing vision-language model that addresses visual forgetting and fine-grained feature extraction challenges in remote sensing VLMs through multi-scale representations and recurrent visual feature injection.

Details

Motivation: Existing vision-language models struggle with remote sensing images due to differences from natural images, failing to extract fine-grained visual features and suffering from visual forgetting during deep language processing.

Method: MF-RSVLM uses multi-scale visual representations to combine global context with local details, and employs a recurrent visual feature injection scheme to keep the language model grounded in visual evidence and reduce forgetting.

Result: Extensive experiments on diverse remote sensing benchmarks show state-of-the-art or highly competitive performance across classification, image captioning, and VQA tasks.

Conclusion: MF-RSVLM effectively addresses the challenges of remote sensing VLMs through multi-feature fusion and recurrent visual injection, achieving strong performance across multiple remote sensing tasks.

Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision–Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

[143] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, Wenqiang Zhang

Main category: cs.CV

TL;DR: RSAgent is an agentic MLLM that performs text-guided object segmentation through multi-turn reasoning and tool interactions, achieving state-of-the-art performance on segmentation benchmarks.

Details

Motivation: Current text-guided segmentation methods treat it as one-shot grounding, which limits verification, refocusing, and refinement when initial localization is wrong. There's a need for iterative reasoning and refinement capabilities.

Method: RSAgent is an agentic MLLM that interleaves reasoning and action via multi-turn tool invocations. It queries a segmentation toolbox, observes visual feedback, and revises spatial hypotheses using historical observations. The model is trained with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with task-specific rewards.

Result: RSAgent achieves 66.5% gIoU on ReasonSeg test (9% improvement over Seg-Zero-7B) and 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

Conclusion: The agentic approach with multi-turn reasoning and tool interactions significantly improves text-guided segmentation performance by enabling verification, refocusing, and iterative refinement capabilities.

Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

[144] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

Mustafa Munir, Md Mostafijur Rahman, Kartikeya Bhardwaj, Paul Whatmough, Radu Marculescu

Main category: cs.CV

TL;DR: PipeFlow enables scalable long-form video editing by skipping low-motion frames, using pipelined parallel processing of video segments, and neural interpolation, achieving up to 31.7x speedup over existing methods.

Details

Motivation: Long-form video editing faces computational challenges due to high costs from joint editing and DDIM inversion across extended sequences, which scale exponentially with video length.

Method: Three key innovations: 1) Motion analysis using SSIM and Optical Flow to skip editing of low-motion frames, 2) Pipelined task scheduling that splits videos into segments for parallel DDIM inversion and joint editing based on GPU memory, 3) Neural network-based interpolation to smooth border frames between segments and interpolate skipped frames.

Result: PipeFlow achieves up to 9.6x speedup compared to TokenFlow and 31.7x speedup over Diffusion Motion Transfer (DMT), with editing time scaling linearly rather than exponentially with video length.

Conclusion: PipeFlow provides a scalable solution for long-form video editing that can theoretically handle infinitely long videos without the growing per-frame computational overhead of existing methods.

Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow’s editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

[145] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

Xinran Qin, Yuhui Quan, Ruotao Xu, Hui Ji

Main category: cs.CV

TL;DR: A reinforcement learning-based trainable anisotropic diffusion framework for image denoising that outperforms traditional diffusion methods and competes with deep CNN approaches.

Details

Motivation: Traditional anisotropic diffusion methods use explicit diffusion operators that are not well adapted to complex image structures, limiting their performance compared to modern learning-based approaches. The authors aim to create a more adaptive diffusion-based denoiser.

Method: A trainable anisotropic diffusion framework using reinforcement learning, where the denoising process is modeled as a series of naive diffusion actions with order learned by deep Q-learning. This creates a stochastic anisotropic diffusion process adaptive to different image structures.

Result: The proposed method outperforms existing diffusion-based methods and competes with representative deep CNN-based methods for removing three types of common noise.

Conclusion: The reinforcement learning-based approach enables more adaptive anisotropic diffusion that bridges the gap between traditional diffusion methods and modern learning-based approaches, achieving competitive denoising performance.

Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

[146] Pathology Context Recalibration Network for Ocular Disease Recognition

Zunjie Xiao, Xiaoqing Zhang, Risa Higashita, Jiang Liu

Main category: cs.CV

TL;DR: PCRNet: A pathology context and expert experience guided deep learning framework for ocular disease recognition with improved performance and interpretability.

Details

Motivation: Current DNNs for ocular disease diagnosis ignore clinical pathology context and expert experience priors, limiting both performance and decision-making interpretability.

Method: Developed Pathology Recalibration Module (PRM) for pathology context prior via pixel-wise context compression and distribution concentration operators. Applied Expert Prior Guidance Adapter (EPGA) to highlight significant pixel-wise regions using expert experience. Combined PRM and EPGA into PCRNet with Integrated Loss (IL) considering sample-wise loss distributions and label frequencies.

Result: PCRNet with IL outperforms state-of-the-art attention-based networks and advanced loss methods on three ocular disease datasets, with visualization explaining PRM and EPGA’s decision-making influence.

Conclusion: Incorporating pathology context and expert experience priors into DNNs significantly improves ocular disease recognition performance and provides better interpretability for clinical decision-making.

Abstract: Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.

[147] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Jingzhou Chen, Dexin Chen, Fengchao Xiong, Yuntao Qian, Liang Xiao

Main category: cs.CV

TL;DR: Proposes balanced hierarchical contrastive loss with decoupled learning in DETR to address hierarchical label imbalance and localization interference in fine-grained remote sensing detection.

Details

Motivation: Fine-grained remote sensing datasets use hierarchical labels, but embedding semantic hierarchy into representation learning is challenging. Previous methods using supervised contrastive learning overlook two issues: (1) imbalanced data distribution across label hierarchy causing high-frequency classes to dominate learning, and (2) semantic relationship learning interfering with class-agnostic localization.

Method: Proposes balanced hierarchical contrastive loss with learnable class prototypes that equilibrates gradients from different classes at each hierarchical level, ensuring equal contribution. Also uses decoupled learning strategy that separates DETR’s object queries into classification and localization sets for task-specific feature extraction and optimization.

Result: Experiments on three fine-grained datasets with hierarchical annotations demonstrate that the method outperforms state-of-the-art approaches.

Conclusion: The proposed balanced hierarchical contrastive loss with decoupled learning effectively addresses hierarchical label imbalance and localization interference, improving fine-grained detection performance in remote sensing applications.

Abstract: Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR’s object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

[148] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Aiyue Chen, Yaofu Liu, Junjian Huang, Guang Lian, Yiwu Yao, Wangli Lan, Jing Lin, Zhixin Ma, Tingting Zhou, Harry Yang

Main category: cs.CV

TL;DR: RainFusion2.0 is a hardware-efficient sparse attention mechanism that accelerates Diffusion Transformer models for video/image generation by achieving 80% sparsity with 1.5-1.8x speedup while maintaining quality.

Details

Motivation: DiT models have high computational costs from attention mechanisms, limiting practical applications. Existing sparse attention methods have overhead issues and lack hardware generality, being mostly GPU-focused despite diverse hardware adoption.

Method: Proposes RainFusion2.0 with three key techniques: (1) using block-wise mean values as representative tokens for sparse mask prediction, (2) implementing spatiotemporal-aware token permutation, and (3) introducing a first-frame sink mechanism for video generation.

Result: Achieves 80% sparsity with 1.5-1.8x end-to-end speedup without compromising video quality. Demonstrates effectiveness across various generative models and validates generalization across diverse hardware platforms.

Conclusion: RainFusion2.0 provides an online adaptive, hardware-efficient, low-overhead sparse attention mechanism that accelerates both video and image generative models with robust performance across diverse hardware platforms.

Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

[149] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Yijie Qian, Juncheng Wang, Yuxiang Feng, Chao Xu, Wang Lu, Yang Liu, Baigui Sun, Yiqiang Chen, Yong Liu, Shujun Wang

Main category: cs.CV

TL;DR: LMR introduces a two-stage “Think-then-Act” approach for text-to-motion generation using dual latent spaces to bridge the semantic-kinematic gap, improving both semantic alignment and physical plausibility.

Details

Motivation: Current T2M methods treat generation as direct translation, facing a "Semantic-Kinematic Impedance Mismatch" - difficulty grounding discrete linguistic intent into continuous motion data in one shot. This System 1 approach has fundamental theoretical limitations.

Method: Proposes Latent Motion Reasoning (LMR) with two-stage architecture: 1) “Think” stage uses Reasoning Latent for global trajectory planning in compressed semantic space, 2) “Act” stage uses Execution Latent for high-frequency motion instantiation. Features Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds.

Result: LMR yields non-trivial improvements in both semantic alignment and physical plausibility when implemented on T2M-GPT (discrete) and MotionStreamer (continuous) baselines. Validates that motion-aligned concept space is better than natural language for motion planning.

Conclusion: The optimal substrate for motion planning is not natural language but a learned, motion-aligned concept space. Architectural shift to Latent System 2 Reasoning with two-stage Think-then-Act process effectively bridges the gap between language and physics.

Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR’s versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}

[150] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Yongtao Chen, Yanbo Wang, Wentao Zhao, Guole Shen, Tianchen Deng, Jingchuan Wang

Main category: cs.CV

TL;DR: A training-free generative adversarial attack framework using diffusion models to create naturalistic adversarial objects that fool monocular depth estimation systems in autonomous driving.

Details

Motivation: Current physical attacks on MDE systems rely on unrealistic texture patches with strict placement constraints, limiting their effectiveness in complex driving environments. Errors in depth estimation can propagate through downstream decision making and impact traffic safety.

Method: A diffusion-based conditional generation framework with two key components: 1) Salient Region Selection module to identify regions most influential to MDE, and 2) Jacobian Vector Product Guidance mechanism to steer adversarial gradients toward update directions supported by the pre-trained diffusion model.

Result: The method generates physically plausible adversarial objects that induce substantial adversarial depth shifts. Extensive digital and physical experiments show it significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability.

Conclusion: The framework demonstrates strong practical implications for autonomous driving safety assessment by creating more realistic and effective adversarial attacks that can better evaluate system vulnerabilities in complex driving environments.

Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

[151] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Zijun Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, Junchi Yan

Main category: cs.CV

TL;DR: GeoBench is a hierarchical benchmark for evaluating geometric reasoning in vision-language models, addressing limitations of current evaluations through four reasoning levels and six formally verified tasks.

Details

Motivation: Current evaluations of geometric reasoning in VLMs have limitations: risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity.

Method: Developed GeoBench with four hierarchical reasoning levels (Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, Self-Reflective Backtracking) and six formally verified tasks generated via TrustGeoGen to systematically assess capabilities from attribute extraction to logical error correction.

Result: Reasoning models like OpenAI-o3 outperform general MLLMs, but performance declines significantly with increasing task complexity. Sub-goal decomposition and irrelevant premise filtering critically influence final accuracy, while Chain-of-Thought prompting unexpectedly degrades performance in some tasks.

Conclusion: GeoBench establishes a comprehensive benchmark for geometric problem-solving and offers actionable guidelines for developing geometric reasoning systems, highlighting the importance of structured reasoning processes and the limitations of current prompting techniques.

Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

[152] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

Main category: cs.CV

TL;DR: LLM-based neural architecture generation for computer vision with systematic prompt engineering (FSAP) and fast deduplication (Whitespace-Normalized Hash Validation), achieving efficient automated design across 7 vision benchmarks.

Details

Motivation: Automated neural architecture design is challenging due to task diversity and computational constraints. LLMs offer a promising alternative to expensive NAS methods, but their application to vision architecture generation lacks systematic study of prompt engineering and validation strategies.

Method: 1) Few-Shot Architecture Prompting (FSAP) - systematic study of supporting examples (n=1-6) for LLM-based architecture generation; 2) Whitespace-Normalized Hash Validation - lightweight deduplication method for preventing redundant training; 3) Dataset-balanced evaluation methodology for comparing architectures across heterogeneous vision tasks.

Result: Found n=3 examples best balances architectural diversity and context focus. Generated 1,900 unique architectures across 7 computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365). Deduplication method provides 100x speedup over AST parsing (less than 1 ms).

Conclusion: Provides actionable guidelines for LLM-based architecture search in computer vision and establishes rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

[153] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li

Main category: cs.CV

TL;DR: The paper introduces DivGenBench to measure Preference Mode Collapse (PMC) in text-to-image diffusion models aligned via RLHF, and proposes D²-Align framework that directionally corrects reward signals to maintain diversity while achieving human preference alignment.

Details

Motivation: Existing RLHF methods for text-to-image diffusion models often lead to Preference Mode Collapse - a form of reward hacking where models converge on narrow, high-scoring outputs (like monolithic styles or overexposure), severely degrading generative diversity despite high automated reward scores.

Method: Proposes Directional Decoupling Alignment (D²-Align) framework: 1) Learns a directional correction within the reward model’s embedding space while keeping the model frozen, 2) Applies this correction to the reward signal during optimization to prevent collapse into specific modes and maintain diversity.

Result: Comprehensive evaluation combining qualitative analysis with quantitative metrics for both quality and diversity shows that D²-Align achieves superior alignment with human preference while maintaining diversity.

Conclusion: The paper successfully identifies and quantifies Preference Mode Collapse, introduces a benchmark to measure it, and provides an effective solution (D²-Align) that prevents reward hacking while maintaining both quality and diversity in text-to-image generation.

Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model’s inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model’s embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

[154] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

TsaiChing Ni, ZhenQi Chen, YuanFu Yang

Main category: cs.CV

TL;DR: IMDD-1M is the first large-scale industrial multimodal defect dataset with 1M image-text pairs, enabling various applications. A diffusion-based vision-language foundation model trained on it achieves comparable performance to expert models with only 5% of task-specific data.

Details

Motivation: There's a lack of large-scale multimodal datasets for industrial defect analysis, limiting the development of advanced AI solutions for manufacturing quality inspection and defect understanding.

Method: Created IMDD-1M dataset with 1M aligned image-text pairs covering 60+ material categories and 400+ defect types. Trained a diffusion-based vision-language foundation model from scratch on this dataset, designed for industrial scenarios.

Result: The foundation model serves as a generalizable base that can be efficiently adapted to specialized domains through lightweight fine-tuning, achieving comparable performance to dedicated expert models with less than 5% of task-specific data.

Conclusion: IMDD-1M enables scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence, demonstrating the potential of data-efficient foundation model adaptation for industrial inspection and generation tasks.

Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

[155] Bayesian Self-Distillation for Image Classification

Anton Adelöw, Matteo Gamba, Atsuto Maki

Main category: cs.CV

TL;DR: BSD uses Bayesian inference on model’s own predictions to create sample-specific target distributions, eliminating hard targets after initialization, improving accuracy, calibration, and robustness.

Details

Motivation: Hard targets in supervised training cause overconfidence, poor calibration, and limited generalization/robustness. Existing self-distillation methods still rely on hard targets, reducing their effectiveness.

Method: Bayesian Self-Distillation (BSD) constructs sample-specific target distributions via Bayesian inference using the model’s own predictions, eliminating dependency on hard targets after initialization.

Result: BSD consistently improves test accuracy (+1.4% ResNet-50 on CIFAR-100) and reduces Expected Calibration Error (-40% ResNet-50, CIFAR-100). Also improves robustness against data corruptions, perturbations, and label noise.

Conclusion: BSD is a principled self-distillation method that outperforms existing architecture-preserving approaches in accuracy, calibration, and robustness, achieving SOTA robustness under label noise when combined with contrastive loss.

Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model’s own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model’s own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

[156] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng

Main category: cs.CV

TL;DR: DiffThinker is a diffusion-based multimodal reasoning framework that treats reasoning as a generative image-to-image task, outperforming leading MLLMs in vision-centric tasks.

Details

Motivation: Current Multimodal Large Language Models (MLLMs) are too text-centric, leading to poor performance in complex long-horizon, vision-centric tasks that require strong logical consistency and spatial precision.

Method: Introduces a Generative Multimodal Reasoning paradigm with DiffThinker, a diffusion-based framework that reformulates multimodal reasoning as a native generative image-to-image task rather than text-centric reasoning.

Result: DiffThinker significantly outperforms leading models: +314.2% over GPT-5, +111.6% over Gemini-3-Flash, and +39.0% over fine-tuned Qwen3-VL-32B across four domains (sequential planning, combinatorial optimization, constraint satisfaction, spatial configuration).

Conclusion: Generative multimodal reasoning via diffusion-based frameworks like DiffThinker represents a promising approach for vision-centric reasoning, offering superior logical consistency and spatial precision compared to text-centric MLLMs.

Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

[157] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

Yu-Tang Chang, Pin-Wei Chen, Shih-Fang Chen

Main category: cs.CV

TL;DR: Deep Global Clustering (DGC) is a memory-efficient framework for hyperspectral image segmentation that learns global clustering from local patches without pre-training, enabling fast training on consumer hardware but suffering from optimization instability due to loss balancing issues.

Details

Motivation: Hyperspectral imaging analysis faces computational bottlenecks from massive data volumes that exceed memory limits. Existing foundation models pre-trained on remote sensing datasets often fail to transfer to domain-specific applications like close-range agricultural monitoring due to fundamental differences in spectral signatures, spatial scales, and semantic targets.

Method: DGC operates on small patches with overlapping regions to enforce consistency, learning global clustering structure from local patch observations without requiring pre-training. The framework maintains constant memory usage and can be trained in under 30 minutes on consumer hardware.

Result: On a leaf disease dataset, DGC achieves background-tissue separation with mean IoU of 0.925 and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability where meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space.

Conclusion: DGC presents a promising design philosophy for memory-efficient HSI segmentation, but stable implementation requires principled approaches to dynamic loss balancing to address the optimization instability rooted in multi-objective loss balancing. The work is positioned as intellectual scaffolding rather than a fully stable solution.

Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at https://github.com/b05611038/HSI_global_clustering.

[158] Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu

Main category: cs.CV

TL;DR: The paper proposes Internal Guidance (IG), a simple training strategy that adds auxiliary supervision on intermediate layers and extrapolates outputs during sampling, improving diffusion model training efficiency and generation quality without extra degradation strategies or sampling steps.

Details

Motivation: Standard diffusion models struggle with low-probability areas due to insufficient training data coverage. Existing guidance methods like CFG cause over-simplified/distorted samples, while alternative methods require carefully designed degradation strategies, extra training, and additional sampling steps.

Method: Internal Guidance (IG) introduces auxiliary supervision on intermediate layers during training and extrapolates intermediate and deep layer outputs during sampling. This simple strategy improves training efficiency and generation quality without complex degradation strategies or additional sampling steps.

Result: IG yields significant improvements on various baselines: SiT-XL/2+IG achieves FID=5.31 (80 epochs) and FID=1.75 (800 epochs) on ImageNet 256x256. LightningDiT-XL/1+IG achieves FID=1.34, and combined with CFG achieves state-of-the-art FID=1.19.

Conclusion: Internal Guidance is a simple yet effective strategy that significantly improves diffusion model performance without the limitations of existing guidance methods, achieving state-of-the-art results on ImageNet generation benchmarks.

Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer’s outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

[159] PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds

Pieter M. Blok, Haozhou Wang, Hyun Kwon Suh, Peicheng Wang, James Burridge, Wei Guo

Main category: cs.CV

TL;DR: PointRAFT is a high-throughput point cloud regression network that directly predicts potato tuber weight from partial RGB-D point clouds, overcoming self-occlusion issues without full 3D reconstruction.

Details

Motivation: Potato yield estimation using RGB-D cameras on harvesters faces challenges due to incomplete point clouds from self-occlusion, leading to systematic underestimation of tuber weight. Current methods struggle with partial data.

Method: PointRAFT uses a novel object height embedding to incorporate tuber height as additional geometric cue, directly predicting continuous 3D shape properties from partial point clouds without full 3D reconstruction.

Result: Achieved MAE of 12.0g and RMSE of 17.2g on test set of 5,254 point clouds, outperforming linear regression and PointNet++ baselines. Processes 150 tubers/second with 6.3ms inference time per point cloud.

Conclusion: PointRAFT enables accurate, high-throughput potato weight estimation on operational harvesters and provides a versatile regression framework for 3D phenotyping and robotic perception tasks beyond agriculture.

Abstract: Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at https://github.com/pieterblok/pointraft.git.

[160] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim

Main category: cs.CV

TL;DR: CorGi is a training-free framework that accelerates DiT inference by caching and reusing low-contribution transformer blocks across denoising steps, achieving up to 2x speedup while maintaining generation quality.

Details

Motivation: Diffusion transformers (DiT) have high inference costs due to iterative denoising and large model capacity, with substantial redundant computation across steps that needs to be reduced.

Method: CorGi selectively reuses outputs of low-contribution transformer blocks across denoising steps using interval caching. CorGi+ extends this for text-to-image tasks by using cross-attention maps to identify salient tokens and apply partial attention updates.

Result: Evaluation on state-of-the-art DiT models shows CorGi and CorGi+ achieve up to 2.0x speedup on average while preserving high generation quality.

Conclusion: CorGi provides an effective training-free approach to reduce redundant computation in DiT models, significantly accelerating inference without compromising visual generation quality.

Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

[161] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Sina Jahromi, Farshid Hajati, Alireza Rezaee, Javaher Nourian

Main category: cs.CV

TL;DR: A progressive GAN generates synthetic data to address class imbalance in COVID-19 chest X-ray classification, combined with real data using weighted approach and optimized via meta-heuristic algorithm, achieving high accuracy on imbalanced datasets.

Details

Motivation: Medical image classification faces severe class imbalance issues, especially during pandemics like COVID-19 where infected cases are rare compared to normal cases. This imbalance hinders AI/ML algorithms from accurately detecting diseases, creating a need for methods to handle imbalanced medical datasets effectively.

Method: Proposes a progressive generative adversarial network (GAN) to generate synthetic data for minority classes. Uses weighted approach to combine synthetic and real data before feeding to deep classifier. Employs multi-objective meta-heuristic population-based optimization algorithm to optimize classifier hyper-parameters.

Result: Achieves 95.5% accuracy for 4-class imbalanced classification and 98.5% accuracy for 2-class imbalanced classification on large chest X-ray COVID-19 dataset. Shows superior cross-validated metrics compared to existing methods.

Conclusion: The proposed model effectively addresses class imbalance in medical image classification during pandemics by generating synthetic data and optimizing classifier parameters, demonstrating practical value for COVID-19 detection and similar medical applications.

Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

[162] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu, Zhewei Zhu, Xuyang Shi

Main category: cs.CV

TL;DR: ARM is a lightweight learnable module that refines CLIP’s internal features for open-vocabulary semantic segmentation, enabling “train once, use anywhere” capability with minimal inference overhead.

Details

Motivation: Existing training-free OVSS methods either rely on expensive external models (SAM, DINO) or use suboptimal static heuristics on CLIP features, lacking efficient pixel-level detail refinement.

Method: ARM uses a semantically-guided cross-attention block to refine detail-rich shallow features with robust deep features, followed by self-attention. It’s trained once on general datasets (COCO-Stuff) then used as plug-and-play post-processor.

Result: ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient paradigm for training-free OVSS.

Conclusion: ARM effectively unlocks CLIP’s internal potential through adaptive hierarchical feature fusion, offering a universal, efficient solution for training-free open-vocabulary semantic segmentation.

Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP’s internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP’s internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

[163] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

Shuyun Wang, Haiyang Sun, Bing Wang, Hangjun Ye, Xin Yu

Main category: cs.CV

TL;DR: Mirage is a one-step video diffusion model for photorealistic and temporally coherent asset editing in driving scenes, addressing fidelity and alignment challenges through 2D-3D feature fusion and two-stage data alignment.

Details

Motivation: Vision-centric autonomous driving needs diverse training data, but existing video object editing methods struggle with maintaining both visual fidelity and temporal coherence for effective data augmentation.

Method: Builds on text-to-video diffusion prior for temporal consistency. Uses temporally agnostic latents from pretrained 2D encoder injected into 3D decoder to restore detail while preserving causality. Introduces two-stage data alignment combining coarse 3D alignment and fine 2D refinement to address distribution mismatch between scene objects and inserted assets.

Result: Extensive experiments show Mirage achieves high realism and temporal consistency across diverse editing scenarios. Can also generalize to other video-to-video translation tasks.

Conclusion: Mirage provides a reliable baseline for photorealistic and coherent asset editing in driving scenes, addressing key challenges in video object editing for autonomous driving data augmentation.

Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.

[164] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

Rahul Medicharla, Alper Yilmaz

Main category: cs.CV

TL;DR: MotivNet is a generalizable facial emotion recognition model that achieves competitive cross-domain performance without cross-domain training by using Meta’s Sapiens vision foundation model as its backbone.

Details

Motivation: Current FER models have weak generalization on diverse real-world data, requiring cross-domain training which is impractical for real-world applications. The paper aims to create a model that generalizes well without needing cross-domain training.

Method: Uses Meta’s Sapiens (a human vision foundational model with masked autoencoder pretraining) as backbone, adds MotivNet as downstream task, and defines three evaluation criteria: benchmark performance, model similarity, and data similarity.

Result: MotivNet achieves competitive performance across datasets without cross-domain training, meets all three evaluation criteria, and validates it as a viable Sapiens downstream task.

Conclusion: MotivNet successfully addresses FER generalization issues, makes FER more practical for real-world applications, and demonstrates the effectiveness of using vision foundation models for emotion recognition tasks.

Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet’s viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.

[165] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

Fuqiang Gu, Yuanke Li, Xianlei Long, Kangping Ji, Chao Chen, Qingyi Gu, Zhenliang Ni

Main category: cs.CV

TL;DR: MambaSeg: A dual-branch semantic segmentation framework using parallel Mamba encoders for RGB and event data fusion with spatial-temporal interaction modules, achieving SOTA performance with lower computational cost.

Details

Motivation: RGB-based segmentation degrades under challenging conditions (fast motion, low-light, HDR) due to frame camera limitations, while event cameras lack color/texture. Existing multimodal fusion methods are computationally expensive and focus mainly on spatial fusion, neglecting temporal dynamics in event streams.

Method: Proposes MambaSeg with parallel Mamba encoders for RGB and event streams. Introduces Dual-Dimensional Interaction Module (DDIM) with Cross-Spatial Interaction Module (CSIM) and Cross-Temporal Interaction Module (CTIM) for fine-grained spatial-temporal fusion to reduce cross-modal ambiguity and improve alignment.

Result: Extensive experiments on DDD17 and DSEC datasets show state-of-the-art segmentation performance while significantly reducing computational cost compared to existing methods.

Conclusion: MambaSeg demonstrates promise for efficient, scalable, and robust multimodal perception by effectively leveraging complementary properties of RGB and event modalities through efficient spatial-temporal fusion.

Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

[166] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

Zhi Li, Yaqi Wang, Bingtao Ma, Yifan Zhang, Huiyu Zhou, Shuai Wang

Main category: cs.CV

TL;DR: PGMP framework for dental CBCT metal artifact reduction uses physics-based simulation for training data and deterministic manifold projection for fast, artifact-free reconstruction without stochastic sampling.

Details

Motivation: Current deep learning methods for metal artifact reduction in dental CBCT have limitations: supervised methods suffer from spectral blurring due to regression-to-the-mean, unsupervised methods risk structural hallucinations, and diffusion models are too slow for clinical use due to iterative sampling.

Method: Three-component framework: 1) Anatomically-Adaptive Physics Simulation (AAPS) synthesizes high-fidelity training pairs using Monte Carlo spectral modeling and patient-specific digital twins; 2) DMP-Former adapts direct x-prediction paradigm for deterministic manifold projection in single forward pass; 3) Semantic-Structural Alignment (SSA) module uses medical foundation model priors (MedDINOv3) to ensure clinical plausibility.

Result: PGMP outperforms state-of-the-art methods on both synthetic and multi-center clinical datasets, particularly on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability.

Conclusion: The PGMP framework successfully addresses key limitations in metal artifact reduction for dental CBCT by combining physics-based simulation, deterministic manifold projection, and medical foundation model guidance, achieving both computational efficiency and clinical reliability.

Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to “regression-to-the-mean”, while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP

[167] Taming Hallucinations: Boosting MLLMs’ Video Understanding via Counterfactual Video Generation

Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang

Main category: cs.CV

TL;DR: DualityForge is a framework that uses diffusion-based video editing to create counterfactual videos and QA pairs to reduce hallucinations in multimodal LLMs, achieving 24% improvement over baseline.

Details

Motivation: MLLMs suffer from visual ungrounded hallucinations due to over-reliance on language priors, especially with counterfactual videos. This is hard to fix because collecting counterfactual data is expensive.

Method: DualityForge uses controllable diffusion-based video editing to transform real videos into counterfactual scenarios, automatically generating QA pairs. They also propose DNA-Train, a two-stage SFT-RL training with pair-wise advantage normalization.

Result: 24.0% relative improvement over Qwen2.5-VL-7B baseline on counterfactual videos, with significant gains on both hallucination and general-purpose benchmarks, showing strong generalization.

Conclusion: The approach effectively reduces MLLM hallucinations on counterfactual videos through synthetic data generation and contrastive training, with promising generalization capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

[168] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Jia Yu, Yan Zhu, Peiyao Fu, Tianyi Chen, Zhihua Wang, Fei Wu, Quanlin Li, Pinghong Zhou, Shuo Wang, Xian Yang

Main category: cs.CV

TL;DR: EndoRare is a one-shot generative framework that synthesizes diverse, high-fidelity rare gastrointestinal lesion images from a single reference, using language-guided concept disentanglement to separate diagnostic features from non-diagnostic attributes.

Details

Motivation: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, limiting data for AI model development and training novice clinicians, creating a "rare-disease gap" in both computer-aided diagnostics and clinical education.

Method: One-shot, retraining-free generative framework using language-guided concept disentanglement to separate pathognomonic lesion features from non-diagnostic attributes. The diagnostic features are encoded into a learnable prototype embedding while varying non-diagnostic attributes to ensure diversity in synthetic images.

Result: Validated across four rare pathologies with synthetic images judged clinically plausible by experts. When used for data augmentation, significantly enhanced downstream AI classifiers (improved true positive rate at low false-positive rates). Blinded reader study showed novice endoscopists exposed to EndoRare-generated cases achieved 0.400 increase in recall and 0.267 increase in precision.

Conclusion: Establishes a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education through one-shot generative synthesis of rare lesion exemplars.

Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

[169] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Md. Enamul Hoq, Linda Larson-Prior, Fred Prior

Main category: cs.CV

TL;DR: Virtual-Eyes CT preprocessing improves generalist foundation models for lung cancer screening but harms specialist models adapted to raw clinical data.

Details

Motivation: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT lung cancer screening, and its differential impact on generalist vs specialist models needs evaluation.

Method: Developed Virtual-Eyes, a 16-bit CT quality-control pipeline that enforces 512x512 resolution, rejects non-diagnostic series, extracts contiguous lung blocks using Hounsfield-unit filtering and lung-coverage scoring. Tested on 765 NLST patients using RAD-DINO, Merlin, Sybil, and ResNet-18 models with frozen encoders.

Result: Virtual-Eyes improved RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and 0.619 to 0.735 (max pooling). Sybil and ResNet-18 degraded under Virtual-Eyes, and Merlin showed limited transferability regardless of preprocessing.

Conclusion: Anatomically targeted preprocessing can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context, highlighting the need for careful preprocessing selection based on model type.

Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

[170] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang

Main category: cs.CV

TL;DR: UniAct is a two-stage framework that integrates a fine-tuned multimodal large language model with a causal streaming pipeline to enable humanoid robots to execute diverse multimodal instructions (language, music, trajectories) with sub-500ms latency, achieving 19% improvement in zero-shot motion tracking.

Details

Motivation: There's a long-standing need for versatile humanoid robots that can follow diverse multimodal instructions with human-level flexibility. Current methods struggle to bridge high-level multimodal perception with whole-body execution and translate heterogeneous instructions into stable, real-time actions.

Method: Two-stage framework: 1) Fine-tuned multimodal large language model (MLLM) for instruction understanding, 2) Causal streaming pipeline for execution. Uses FSQ (Finite Scalar Quantization) to unify inputs through a shared discrete codebook, ensuring cross-modal alignment while constraining motions to physically grounded manifolds.

Result: Achieves sub-500ms latency for executing multimodal instructions. Shows 19% improvement in success rate for zero-shot tracking of imperfect reference motions. Validated on UniMoCap, a 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios.

Conclusion: UniAct represents a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control, bridging the gap between high-level multimodal understanding and real-time whole-body execution.

Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions – such as language, music, and trajectories – into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang, Liang Lin

Main category: cs.CV

TL;DR: CERES is a causal framework that improves egocentric referring video object segmentation by addressing dataset biases and visual confounding factors through dual-modal causal intervention.

Details

Motivation: Existing Ego-RVOS methods struggle with dataset biases (skewed object-action pairings) and egocentric visual challenges (rapid motion, occlusions), leading to spurious correlations and poor generalization.

Method: CERES implements dual-modal causal intervention: 1) backdoor adjustment to counteract language representation biases from dataset statistics, and 2) front-door adjustment to integrate semantic visual features with geometric depth information guided by causal principles.

Result: CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, demonstrating the effectiveness of causal reasoning for robust egocentric video understanding.

Conclusion: The framework shows the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding tasks.

Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

[172] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu

Main category: cs.CV

TL;DR: SenseNova-MARS is a multimodal agentic reasoning framework that uses RL to enable VLMs to interleave visual reasoning with dynamic tool use (search, cropping) for complex visual understanding tasks.

Details

Motivation: Current VLMs are limited to text-oriented reasoning and isolated tool use, lacking the human-like ability to seamlessly interleave dynamic tool manipulation with continuous reasoning in knowledge-intensive, visually complex scenarios.

Method: SenseNova-MARS framework integrates image search, text search, and image crop tools with VLMs using reinforcement learning. The BN-GSPO algorithm improves training stability and tool invocation capabilities.

Result: Achieves SOTA performance on search and fine-grained image understanding benchmarks: 67.84 on MMSearch and 41.64 on HR-MMSearch (new benchmark), surpassing proprietary models like Gemini-3-Flash and GPT-5.

Conclusion: SenseNova-MARS represents significant progress toward agentic VLMs with effective tool-use capabilities. The authors will release all code, models, and datasets to facilitate further research.

Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model’s ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

[173] Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

Main category: cs.CV

TL;DR: LVLDrive enhances Vision-Language Models for autonomous driving by incorporating LiDAR point clouds to improve 3D spatial understanding, addressing limitations of 2D image-based methods.

Details

Motivation: Current Vision-Language Models (VLMs) for autonomous driving rely on 2D image cues, which struggle with accurate metric spatial reasoning and geometric inference, creating safety and reliability bottlenecks. There's a need to bridge the gap between VLMs' language understanding and robust 3D spatial perception.

Method: Proposes LVLDrive framework that upgrades VLMs with LiDAR point cloud input. Uses Gradual Fusion Q-Former to incrementally inject LiDAR features while preserving pre-trained VLM knowledge. Develops spatial-aware question-answering (SA-QA) dataset to teach advanced 3D perception and reasoning capabilities.

Result: Extensive experiments on driving benchmarks show LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making.

Conclusion: The work demonstrates the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems, successfully enhancing VLMs with robust 3D spatial understanding through LiDAR integration.

Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM’s existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

[174] The Mechanics of CNN Filtering with Rectification

Liam Frija-Altrac, Matthew Toews

Main category: cs.CV

TL;DR: The paper proposes elementary information mechanics as a new model for understanding CNN filtering mechanics, drawing analogies between kernel even/odd components and physical energy-momentum relations.

Details

Motivation: To establish a theoretical framework connecting information processing in convolutional neural networks to fundamental physical principles, specifically the energy-momentum relation from relativistic physics.

Method: Decomposes convolutional kernels into orthogonal even and odd components, analyzes their effects on image content (even components cause isotropic diffusion, odd components cause directional displacement), and examines these properties in the spectral domain using discrete cosine transform (DCT) to identify fundamental modes of information propagation.

Result: Shows that even kernels preserve center of mass (analogous to rest/potential energy), odd kernels displace center of mass (analogous to kinetic energy), and information displacement speed is linearly related to odd vs total kernel energy ratio. Small convolutional filters are dominated by low-frequency DCT bases (DC Σ and gradient ∇ components).

Conclusion: This work establishes the first demonstrated link between information processing in generic CNNs and the energy-momentum relation from modern relativistic physics, providing a new “elementary information mechanics” framework for understanding CNN operations.

Abstract: This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $Σ$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.

[175] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella, Roberto Andres Novoa, Josep Malvehy

Main category: cs.CV

TL;DR: DermaVQA-DAS extends DermaVQA with expert-developed Dermatology Assessment Schema (DAS) for structured clinical feature annotation, supporting both closed QA and lesion segmentation tasks with multilingual benchmarks.

Details

Motivation: Existing dermatological datasets focus on dermatoscopic images without patient queries and clinical context, limiting applicability to patient-centered care. Need for datasets that incorporate patient-authored content and systematic clinical assessment.

Method: Introduce DAS framework with 36 high-level and 27 fine-grained assessment questions (English/Chinese). Extend DermaVQA to support closed QA and segmentation tasks. Benchmark multimodal models with various prompting strategies for segmentation and evaluate multiple LLMs for QA.

Result: For segmentation: prompt design impacts performance - default prompt best under Mean-of-Max/Mean-of-Mean, augmented prompt (patient query title+content) best under majority-vote (Jaccard: 0.395, Dice: 0.566 with BiomedParse). For QA: models achieve 0.729-0.798 accuracy, o3 best (0.798), GPT-4.1 close second (0.796), Gemini-1.5-Pro competitive (0.783).

Conclusion: DermaVQA-DAS with DAS schema addresses limitations of existing datasets by providing structured clinical annotations and supporting patient-centered dermatological vision-language modeling. Public release aims to accelerate research in this domain.

Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi

Main category: cs.CV

TL;DR: This paper presents a comprehensive framework and taxonomy for multi-modal pre-training to achieve Spatial Intelligence from sensor data like cameras and LiDAR, addressing integration challenges and proposing a roadmap for general-purpose foundation models.

Details

Motivation: The rapid advancement of autonomous systems (self-driving vehicles, drones) requires true Spatial Intelligence from multi-modal sensor data. While foundation models work well for single modalities, integrating capabilities across diverse sensors like cameras and LiDAR to create unified understanding remains a major challenge.

Method: The paper presents a comprehensive framework for multi-modal pre-training, analyzing the interplay between foundational sensor characteristics and learning strategies. It formulates a unified taxonomy for pre-training paradigms ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations. The framework also investigates integration of textual inputs and occupancy representations.

Result: The paper identifies the core set of techniques driving progress toward multi-modal Spatial Intelligence, evaluates the role of platform-specific datasets, and provides a taxonomy for pre-training paradigms. It also identifies critical bottlenecks like computational efficiency and model scalability.

Conclusion: The paper proposes a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment in autonomous systems, addressing current limitations and charting future research directions.

Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[177] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics

Gur-Eyal Sela, Kumar Krishna Agrawal, Bharathan Balaji, Joseph Gonzalez, Ion Stoica

Main category: cs.CV

TL;DR: RedunCut is a dynamic model size selection system for live video analytics that reduces compute costs by 14-62% at fixed accuracy through intelligent sampling and accurate performance prediction.

Details

Motivation: Live video analytics faces high inference costs with modern vision models across massive camera fleets. Existing dynamic model size selection (DMSS) approaches fail to generalize to diverse workloads like mobile videos and lower accuracy targets due to inefficient sampling and inaccurate accuracy prediction.

Method: RedunCut introduces two key components: 1) a measurement-driven planner that estimates the cost-benefit tradeoff of sampling to avoid inefficient sampling, and 2) a lightweight, data-driven performance model to improve per-segment accuracy prediction.

Result: Across road-vehicle, drone, and surveillance videos with multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy while remaining robust to limited historical data and drift.

Conclusion: RedunCut successfully addresses the limitations of prior DMSS systems by optimizing the sampling strategy and improving accuracy prediction, enabling significant cost reductions in live video analytics without model retraining or modification.

Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.

[178] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen, Haiyang Liu

Main category: cs.CV

TL;DR: DyStream is a real-time talking head video generation system that achieves ultra-low latency (34ms per frame) for dyadic conversations by using flow matching-based autoregressive modeling with causal encoder and lookahead module.

Details

Motivation: Existing chunk-based methods for talking head video generation require full non-causal context windows, introducing significant delays that prevent the immediate, non-verbal feedback needed for realistic listener responses in dyadic conversations.

Method: 1) Stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling; 2) Causal encoder enhanced by a lookahead module that incorporates short future context (60ms) to improve quality while maintaining low latency.

Result: Generates video within 34ms per frame, keeping total system latency under 100ms. Achieves state-of-the-art lip-sync quality with offline LipSync Confidence score of 8.13 and online score of 7.61 on HDTF dataset.

Conclusion: DyStream demonstrates that simple-and-effective causal methods can achieve real-time dyadic talking head video generation with high quality and ultra-low latency, significantly outperforming alternative causal strategies like distillation and generative encoder approaches.

Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

[179] AI-Driven Evaluation of Surgical Skill via Action Recognition

Yan Meng, Daniel A. Donoho, Marcelle Altshuler, Omar Arnaout

Main category: cs.CV

TL;DR: AI-driven framework for automated assessment of microanastomosis surgical skill using video transformers and motion analysis, achieving high accuracy in replicating expert evaluations.

Details

Motivation: Conventional surgical skill assessment methods are subjective, time-consuming, and require expert supervision, limiting scalability especially in low-resource settings. Need for objective, consistent, and scalable evaluation methods.

Method: Video transformer architecture (TimeSformer) enhanced with hierarchical temporal attention and weighted spatial attention for action recognition. YOLO-based object detection and tracking for fine-grained motion feature extraction. Evaluates five aspects of microanastomosis skill.

Result: 87.7% frame-level accuracy in action segmentation (increased to 93.62% with post-processing). 76% average classification accuracy in replicating expert assessments across all skill aspects.

Conclusion: The AI system provides objective, consistent, and interpretable feedback for surgical skill assessment, enabling standardized, data-driven training and evaluation in surgical education.

Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system’s potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.

[180] Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas, Pranav K Nayak, Divya Mehul Rajparia, Deekshith Patel, Yashmitha Gogineni, Konda Reddy Mopuri, Sumohana S. Channappayya

Main category: cs.CV

TL;DR: ViT encoder representations show compositionality when using Discrete Wavelet Transform primitives, with composed representations approximating original image representations.

Details

Motivation: To understand how Vision Transformers structure information by investigating whether their representations exhibit compositionality, similar to what has been studied in language models.

Method: Introduces a framework using Discrete Wavelet Transform (DWT) to obtain input-dependent primitives for vision, then tests compositionality by examining if composed representations can reproduce original image representations in the ViT encoder.

Result: Primitives from one-level DWT decomposition produce encoder representations that approximately compose in latent space, revealing compositional structure in ViT representations.

Conclusion: Vision Transformers exhibit compositional structure in their representations when analyzed through DWT primitives, providing new insights into how ViTs organize visual information.

Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.

[181] Spectral and Spatial Graph Learning for Multispectral Solar Image Compression

Prasiddha Siwakoti, Atefeh Khoshkhahtinat, Piyush M. Mehta, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva

Main category: cs.CV

TL;DR: A learned compression framework for multispectral solar imagery using graph-based modules to model spectral relationships and spatial redundancy, achieving better spectral fidelity and reconstruction quality than baselines.

Details

Motivation: High-fidelity compression of multispectral solar imagery is challenging for space missions due to limited bandwidth, requiring preservation of both fine spectral and spatial details while balancing compression efficiency.

Method: Two complementary modules: (1) iSWGE (Inter-Spectral Windowed Graph Embedding) models inter-band relationships by representing spectral channels as graph nodes with learned edge features; (2) WSGA-C (Windowed Spatial Graph Attention and Convolutional Block Attention) combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures.

Result: On SDOML dataset across six EUV channels: 20.15% reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates.

Conclusion: The proposed learned compression framework effectively addresses the challenges of multispectral solar imagery compression by leveraging graph-based modeling of spectral relationships and spatial attention mechanisms, achieving superior spectral fidelity and reconstruction quality for space mission applications.

Abstract: High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at https://github.com/agyat4/sgraph .

[182] Using Large Language Models To Translate Machine Results To Human Results

Trishna Niraula, Jonathan Stubblefield

Main category: cs.CV

TL;DR: AI pipeline combines YOLO object detection models with GPT-4 to generate radiology reports from chest X-ray images, achieving clinical accuracy but with stylistic differences from human radiologists.

Details

Motivation: Current AI systems in medical imaging only provide structured predictions, requiring radiologists to manually convert these into narrative reports. There's a need to bridge this gap by automatically generating comprehensive diagnostic narratives from AI findings.

Method: Developed a pipeline integrating YOLOv5 and YOLOv8 for anomaly detection in chest X-rays, then feeding the bounding-box predictions and class labels to GPT-4 to generate natural-language radiology reports. Compared both YOLO models on detection accuracy, inference latency, and text quality.

Result: Strong semantic similarity between AI and human reports. GPT-4 excelled in clarity (4.88/5) but scored lower for natural writing flow (2.81/5). The system achieves clinical accuracy but remains stylistically distinguishable from radiologist-authored text.

Conclusion: The integration of object detection models with LLMs successfully generates clinically accurate radiology reports, but current systems still lack the natural writing style of human radiologists, indicating room for improvement in narrative quality.

Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

[183] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

Manikanta Kotthapalli, Banafsheh Rekabdar

Main category: cs.CV

TL;DR: MS-VQ-VAE: A lightweight multi-scale VQ-VAE for generating compact latent representations of low-resolution video, optimized for edge deployment and bandwidth-sensitive applications.

Details

Motivation: Traditional video codecs lack native support for machine learning latent representations, limiting integration with deep learning pipelines. There's a need for efficient video compression suitable for bandwidth-sensitive scenarios like CDNs and edge devices.

Method: Extends VQ-VAE-2 to spatiotemporal setting with two-level hierarchical latent structure using 3D residual convolutions. Incorporates perceptual loss from pre-trained VGG16. Lightweight design (18.5M parameters) optimized for 64x64 resolution video clips.

Result: Achieves 25.96 dB PSNR and 0.8375 SSIM on UCF101 test set. Improves over single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. Trained on 2-second video clips (32 frames at 16 FPS).

Conclusion: The framework is well-suited for scalable video compression in bandwidth-sensitive scenarios including real-time streaming, mobile video analytics, and CDN-level storage optimization.

Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

[184] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

Main category: cs.CV

TL;DR: PhyGDPO: A physics-aware video generation method using groupwise preference optimization with physics-guided rewards to improve physical consistency in text-to-video synthesis.

Details

Motivation: Existing text-to-video methods struggle to generate videos that follow physical laws, due to limitations in physics reasoning and scarcity of training data with rich physics interactions.

Method: 1) PhyAugPipe: Uses VLM with chain-of-thought reasoning to create PhyVidGen-135K dataset. 2) PhyGDPO: Physics-aware Groupwise Direct Preference Optimization with Physics-Guided Rewarding scheme using VLM-based physics rewards. 3) LoRA-Switch Reference for efficient training.

Result: Significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2 benchmarks.

Conclusion: The proposed physics-aware framework effectively improves physical consistency in video generation through data construction, groupwise preference optimization, and physics-guided rewarding.

Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[185] OCP-LS: An Efficient Algorithm for Visual Localization

Jindi Zhong, Hongxia Wang, Huanshui Zhang

Main category: cs.CV

TL;DR: Novel second-order optimization algorithm for deep learning that combines OCP method with Hessian diagonal approximation, achieving faster convergence and better robustness than conventional methods.

Details

Motivation: Address large-scale optimization problems in deep learning by developing a more efficient second-order optimization approach that can handle the computational challenges of Hessian matrices while improving training performance.

Method: Proposes a novel second-order optimization algorithm that incorporates the OCP method and appropriately approximates the diagonal elements of the Hessian matrix to make computation feasible for large-scale problems.

Result: Extensive experiments on multiple standard visual localization benchmarks demonstrate significant superiority over conventional optimization algorithms, achieving competitive localization accuracy with faster convergence, enhanced training stability, and improved robustness to noise interference.

Conclusion: The proposed second-order optimization framework successfully addresses large-scale deep learning optimization challenges, offering a practical solution that combines computational efficiency with superior performance characteristics compared to existing methods.

Abstract: This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

Wentao Zhang, Tao Fang, Lina Lu, Lifei Wang, Weihe Zhong

Main category: cs.CV

TL;DR: CPJ is a training-free few-shot framework that uses structured image captions and LLM-as-Judge to improve agricultural pest VQA performance without fine-tuning.

Details

Motivation: Existing crop disease diagnosis methods require costly supervised fine-tuning and perform poorly under domain shifts, lacking interpretability.

Method: Caption-Prompt-Judge (CPJ) framework: generates multi-angle image captions using vision-language models, refines them via LLM-as-Judge module, then uses dual-answer VQA for both disease recognition and management responses.

Result: On CDDMBench, CPJ significantly improves performance: GPT-5-mini captions with GPT-5-Nano achieve +22.7 percentage points in disease classification and +19.5 points in QA score over no-caption baselines.

Conclusion: CPJ provides transparent, evidence-based reasoning for robust and explainable agricultural diagnosis without requiring fine-tuning, advancing interpretable crop disease diagnosis.

Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption–Prompt–Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

[187] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Tianyi Zhao, Jiawen Xi, Linhui Xiao, Junnan Li, Xue Yang, Maoxun Yuan, Xingxing Wei

Main category: cs.CV

TL;DR: RGBT-Ground: First large-scale RGB-Thermal visual grounding benchmark for complex real-world scenarios with aligned image pairs and comprehensive annotations.

Details

Motivation: Existing VG benchmarks lack scene diversity and fail to reflect real-world complexities like illumination changes and weather conditions, limiting evaluation of model robustness for safety-critical applications.

Method: Created RGBT-Ground benchmark with aligned RGB-Thermal image pairs, referring expressions, bounding boxes, and fine-grained annotations. Proposed RGBT-VGNet baseline for multi-modal fusion and unified framework supporting uni-modal (RGB/TIR) and multi-modal inputs.

Result: RGBT-VGNet significantly outperforms adapted existing methods, especially in challenging nighttime and long-distance scenarios.

Conclusion: RGBT-Ground enables robust visual grounding evaluation in complex real-world conditions, with RGBT-VGNet demonstrating effective multi-modal fusion. All resources will be publicly released to advance research.

Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

[188] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Fuyu Dong, Ke Li, Di Wang, Nan Luo, Yiming Zhang, Kaiyu Li, Jianfei Yang, Quan Wang

Main category: cs.CV

TL;DR: DARFT improves CDVQA performance by addressing decision ambiguity through reinforcement fine-tuning that targets samples with small probability margins between correct answers and strong distractors.

Details

Motivation: Current CDVQA models using supervised fine-tuning often fail due to decision ambiguity - where models assign similar confidence to correct answers and strong distractors, rather than making clearly wrong predictions.

Method: Proposes DARFT: Decision-Ambiguity-guided Reinforcement Fine-Tuning that first mines Decision-Ambiguous Samples (DAS) using an SFT-trained reference policy, then applies group-relative policy optimization on the mined subset using multi-sample decoding and intra-group relative advantages.

Result: Extensive experiments show consistent gains over SFT baselines, particularly under few-shot settings, demonstrating improved discriminability and robustness.

Conclusion: Explicitly optimizing decision-ambiguous samples is crucial for improving CDVQA model performance, and DARFT effectively addresses this challenge without requiring additional supervision.

Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

[189] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

Main category: cs.CV

TL;DR: SliceLens is a framework that uses LLMs/VLMs for fine-grained error slice discovery in instance-level vision tasks, outperforming existing methods and validated through a new benchmark FeSD.

Details

Motivation: Current slice discovery methods are limited to image classification and struggle with multi-instance tasks like detection/segmentation. Existing benchmarks are algorithm-specific or biased toward classification with artificial ground truth that doesn't reflect real model failures.

Method: SliceLens uses LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained, interpretable error slices.

Result: SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs 0.31) on FeSD benchmark, and identifies interpretable slices that facilitate actionable model improvements validated through repair experiments.

Conclusion: SliceLens addresses limitations of existing slice discovery methods by enabling fine-grained error analysis for instance-level vision tasks through LLM/VLM-powered hypothesis generation and verification, with strong empirical validation on a new comprehensive benchmark.

Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

[190] 3D Semantic Segmentation for Post-Disaster Assessment

Nhut Le, Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: Researchers created a specialized 3D dataset from Hurricane Ian UAV footage and tested SOTA 3D segmentation models, revealing their limitations for post-disaster environments.

Details

Motivation: Natural disasters cause severe threats to lives and economic losses, but existing 3D semantic segmentation models lack datasets specifically designed for post-disaster environments, creating a gap in effective post-disaster assessment.

Method: Constructed a specialized 3D dataset using UAV-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. Evaluated SOTA 3D semantic segmentation models including Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset.

Result: The evaluation exposed significant limitations in existing 3D semantic segmentation methods for disaster-stricken regions, showing that current SOTA models perform poorly in post-disaster environments.

Conclusion: There is an urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response capabilities.

Abstract: The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.

[191] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers

Zheng Liu, Jinchao Zhu, Gao Huang

Main category: cs.CV

TL;DR: CLoRA is a novel collaborative low-rank adaptation method that balances fine-tuning performance and parameter efficiency for vision transformers through base-space sharing and diversity enhancement.

Details

Motivation: Existing LoRA methods either sacrifice performance for parameter efficiency or introduce excessive parameters, failing to balance learning performance and parameter efficiency.

Method: CLoRA uses base-space sharing where all low-rank modules share projection spaces, and sample-agnostic diversity enhancement (SADE) to regularize representation similarities and encourage diversity.

Result: CLoRA achieves better balance between learning performance and parameter efficiency, requiring fewest GFLOPs for point cloud analysis compared to state-of-the-art methods.

Conclusion: CLoRA effectively addresses the trade-off between fine-tuning performance and parameter efficiency in vision transformers through collaborative low-rank adaptation with diversity enhancement.

Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.

Panquan Yang, Junfei Huang, Zongzhangbao Yin, Yingsong Hu, Anni Xu, Xinyi Luo, Xueqi Sun, Hai Wu, Sheng Ao, Zhaoxing Zhu, Chenglu Wen, Cheng Wang

Main category: cs.CV

TL;DR: First 3D visual grounding dataset for outdoor monitoring scenarios (roadside infrastructure) with 136K objects and 411K language expressions, plus a new multi-modal method Moni3DVG.

Details

Motivation: Existing 3D visual grounding focuses on indoor/ego-vehicle perspectives, but roadside infrastructure monitoring remains unexplored due to lack of paired point cloud-text data from roadside sensors.

Method: Constructed MoniRefer dataset from real-world traffic intersections, manually verified labels. Proposed Moni3DVG method that fuses appearance information from images with geometry/optical information from point clouds for multi-modal feature learning.

Result: Created large-scale dataset with 136,018 objects and 411,128 natural language expressions. Extensive experiments show superiority and effectiveness of Moni3DVG method on the new benchmark.

Conclusion: Introduces first 3D visual grounding task for outdoor monitoring scenarios, provides comprehensive dataset, and demonstrates effective multi-modal approach for infrastructure-level traffic scene understanding.

Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

[193] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning

Shuyuan Lin, Yu Guo, Xiao Chen, Yanjie Liang, Guobao Xiao, Feiran Huang

Main category: cs.CV

TL;DR: A Layer-by-Layer Hierarchical Attention Network for robust feature point matching that handles outliers through stage fusion, hierarchical extraction, and attention mechanisms.

Details

Motivation: Feature point matching is fundamental but suffers from outlier interference, especially with high outlier proportions. Need to extract high-quality information while reducing negative sample errors.

Method: Proposes Layer-by-Layer Hierarchical Attention Network with: 1) Layer-by-layer channel fusion module to preserve semantic information from each stage, 2) Hierarchical attention module to adaptively capture global perception and structural semantics, 3) Two architectures for feature extraction and integration.

Result: Outperforms state-of-the-art methods on YFCC100M and SUN3D datasets for both outlier removal and camera pose estimation.

Conclusion: The proposed network effectively handles outliers in feature point matching through hierarchical attention and fusion mechanisms, improving matching precision and robustness in computer vision tasks.

Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network’s representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.

[194] FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes

Qingyu Xu, Runtong Zhang, Zihuan Qiu, Fanman Meng

Main category: cs.CV

TL;DR: Proposes FRS-YOLO model and FireRescue dataset for improved object detection in fire rescue scenarios, addressing urban scene complexity and limited target classes.

Details

Motivation: Current fire detection research focuses on forest/mountain environments with limited classes (flames/smoke), lacking comprehensive urban rescue scene coverage and key command decision targets like fire trucks and firefighters.

Method: 1) Creates FireRescue dataset with 15,980 images, 32,000 bounding boxes across urban/mountain/forest/water areas covering 8 key categories. 2) Proposes FRS-YOLO with plug-and-play multidimensional collaborative enhancement attention module for inter-class discrimination and dynamic feature sampler for small target detection.

Result: Experimental results show fire rescue object detection is highly challenging, and FRS-YOLO effectively improves YOLO series models’ detection performance in this context.

Conclusion: The proposed FireRescue dataset and FRS-YOLO model address critical gaps in fire rescue object detection, providing better tools for command decision-making in complex rescue scenarios.

Abstract: Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named “FireRescue” for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.

[195] Renormalization Group Guided Tensor Network Structure Search

Maolin Wang, Bowen Yu, Sheng Zhang, Linjie Mi, Wanyu Wang, Yiqi Wang, Pengyue Jia, Xuetao Wei, Zenglin Xu, Ruocheng Guo, Xiangyu Zhao

Main category: cs.CV

TL;DR: RGTN is a physics-inspired tensor network structure search framework that uses renormalization group flows for multi-scale optimization, achieving state-of-the-art compression with 4-600× speedup over existing methods.

Details

Motivation: Existing tensor network structure search methods face three key limitations: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth evolution, and separated structure-parameter optimization causing inefficiency. These challenges limit computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics.

Method: RGTN uses dynamic scale-transformation with renormalization group flows for continuous structure evolution across resolutions. Key innovations include learnable edge gates for topology modification during optimization, and intelligent proposals based on physical quantities like node tension (measuring local stress) and edge information flow (quantifying connectivity importance). The method starts from low-complexity coarse scales and refines to finer ones, escaping local minima via scale-induced perturbations.

Result: Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600× faster than existing methods. The framework demonstrates superior computational efficiency and effectiveness in discovering optimal tensor network structures.

Conclusion: RGTN successfully transforms tensor network structure search through physics-inspired multi-scale renormalization group flows, overcoming key limitations of existing methods. The approach validates the effectiveness of incorporating physical principles into tensor decomposition optimization, enabling more efficient and robust structure discovery for high-dimensional data representation.

Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.

[196] From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Siyang Wang, Hanting Li, Wei Li, Jie Hu, Xinghao Chen, Feng Zhao

Main category: cs.CV

TL;DR: RadAR accelerates autoregressive visual generation using radial topology and parallel ring-wise prediction with nested attention for error correction.

Details

Motivation: Autoregressive models in visual generation suffer from low inference efficiency due to sequential token-by-token decoding, despite their success in language modeling.

Method: Organizes generation around radial topology: starts with center token, groups others into concentric rings by spatial distance, predicts tokens ring-wise from inner to outer. Uses nested attention mechanism to refine implausible outputs during forward pass.

Result: Significantly improves generation efficiency while preserving representational capacity by exploiting visual tokens’ local dependencies and spatial correlations.

Conclusion: RadAR provides an efficient parallelizable framework for autoregressive visual generation that maintains spatial coherence and prevents error accumulation through radial prediction and dynamic correction.

Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors–a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

[197] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao

Main category: cs.CV

TL;DR: EVOL-SAM3 is a zero-shot reasoning segmentation framework that uses evolutionary search at inference time instead of traditional training methods, achieving state-of-the-art performance on ReasonSeg benchmark.

Details

Motivation: Current reasoning segmentation methods have limitations: SFT suffers from catastrophic forgetting and domain dependency, RL has training instability and rigid reward functions, while training-free methods are limited by static single-pass inference that lacks reasoning depth and self-correction capabilities.

Method: Proposes EVOL-SAM3, a zero-shot framework that reformulates reasoning segmentation as inference-time evolutionary search. It maintains a population of prompt hypotheses and refines them through a “Generate-Evaluate-Evolve” loop with Visual Arena for fitness assessment via reference-free pairwise tournaments, Semantic Mutation operator for diversity injection, and Heterogeneous Arena for final selection.

Result: Extensive experiments show EVOL-SAM3 substantially outperforms static baselines and significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting.

Conclusion: EVOL-SAM3 successfully addresses limitations of current reasoning segmentation approaches by introducing evolutionary search at inference time, enabling deeper reasoning, self-correction, and superior performance without requiring training.

Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass “generate-then-segment” chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a “Generate-Evaluate-Evolve” loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

[198] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

Main category: cs.CV

TL;DR: FlowBlending is a stage-aware multi-model sampling strategy that uses large models for capacity-sensitive early/late stages and small models for intermediate stages, achieving up to 1.65x faster inference with 57.35% fewer FLOPs while maintaining quality.

Details

Motivation: The authors observed that model capacity impact varies across diffusion timesteps - crucial for early and late stages but negligible during intermediate stages. This insight motivates a more efficient approach to video generation by strategically allocating computational resources.

Method: FlowBlending uses a stage-aware multi-model sampling strategy: large model for capacity-sensitive early/late stages, small model for intermediate stages. Introduces criteria for stage boundaries and uses velocity-divergence analysis to identify capacity-sensitive regions. Compatible with existing sampling-acceleration techniques.

Result: Achieves up to 1.65x faster inference with 57.35% fewer FLOPs across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B) while maintaining visual fidelity, temporal coherence, and semantic alignment of large models. Compatible with existing techniques enables up to 2x additional speedup.

Conclusion: FlowBlending provides an efficient video generation framework by strategically allocating model capacity based on timestep sensitivity, achieving significant speedups without quality degradation, and demonstrating compatibility with other acceleration methods.

Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

[199] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu

Main category: cs.CV

TL;DR: EchoFoley introduces a new video-grounded sound generation task with fine-grained control, addressing limitations in current VT2A models through symbolic event representation and a large curated dataset.

Details

Motivation: Current video-text-to-audio (VT2A) models have three key limitations: visual-text conditioning imbalance causing visual dominance, lack of fine-grained controllability definitions, and weak instruction understanding due to brief categorical tags in existing datasets.

Method: Proposes EchoFoley task with symbolic representation for sounding events (when, what, how), constructs EchoFoley-6k dataset (6k video-instruction-annotation triplets), and develops EchoVidia framework with slow-fast thinking strategy for agentic generation.

Result: EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality, demonstrating significant improvements in fine-grained sound generation control.

Conclusion: The EchoFoley framework successfully addresses key limitations in video-grounded sound generation by enabling fine-grained control through symbolic event representation and hierarchical semantic control, with substantial improvements over existing methods.

Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

[200] Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

Xiang Liu, Yimin Zhou, Jinxiang Wang, Yujun Huang, Shuzhao Xie, Shiyu Qin, Mingyao Hong, Jiawei Li, Yaowei Wang, Zhi Wang, Shu-Tao Xia, Bin Chen

Main category: cs.CV

TL;DR: Splatwizard is a unified benchmark toolkit for evaluating 3D Gaussian Splatting compression models, addressing the lack of standardized evaluation tools in this rapidly growing field.

Details

Motivation: The rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression tasks. Existing benchmarks often lack specific metrics to holistically assess unique characteristics like rendering speed, rate-distortion trade-offs, memory efficiency, and geometric accuracy.

Method: Splatwizard provides an easy-to-use framework to implement new 3DGS compression models and utilize state-of-the-art techniques. It includes an integrated pipeline that automates calculation of key performance indicators including image-based quality metrics, chamfer distance of reconstructed mesh, rendering frame rates, and computational resource consumption.

Result: The authors introduce Splatwizard as a unified benchmark toolkit specifically designed for benchmarking 3DGS compression models, with code available at https://github.com/splatwizard/splatwizard.

Conclusion: Splatwizard addresses the critical gap in standardized evaluation tools for 3DGS compression models, providing comprehensive metrics and an easy-to-use framework that will facilitate fair comparison and advancement in this rapidly evolving field.

Abstract: The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at https://github.com/splatwizard/splatwizard

[201] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Ankit Dhiman, Srinath R, Jaswanth Reddy, Lokesh R Boregowda, Venkatesh Babu Radhakrishnan

Main category: cs.CV

TL;DR: 3DGS-NeRF segmentation framework with unified learnable embeddings and boundary-aware hard mining for improved instance segmentation.

Details

Motivation: Existing 3D segmentation methods suffer from inconsistent 2D instance labels across views, requiring two-stage approaches with hyperparameter-sensitive clustering or preprocessing. Need unified framework for better efficiency and performance.

Method: Propose unified framework with learnable feature embeddings in Gaussian primitives, decoded via “Embedding-to-Label” process. Address boundary artifacts with hard-mining on boundaries, stabilized by applying linear layer to rasterized embeddings before triplet loss.

Result: Outperforms baselines qualitatively and quantitatively on ScanNet, Replica3D, and Messy-Rooms datasets. Reduces training time while improving segmentation performance.

Conclusion: Unified framework with learnable embeddings and boundary-aware hard mining effectively addresses 2D label inconsistency, improves 3D segmentation performance, and reduces training complexity compared to existing two-stage approaches.

Abstract: 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel “Embedding-to-Label” process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

[202] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Takeru Kusakabe, Yudai Hirose, Mashiho Mukaida, Satoshi Ono

Main category: cs.CV

TL;DR: Proposes a projection-based adversarial attack method using physics-in-the-loop optimization to create adversarial examples that cause depth misestimation in monocular depth estimation models.

Details

Motivation: DNN-based monocular depth estimation models are vulnerable to adversarial attacks, threatening their reliability in practical applications. There's a critical need to validate and understand these vulnerabilities to enhance robustness.

Method: Uses a projection-based adversarial attack that projects perturbation light onto target objects. Employs physics-in-the-loop optimization to evaluate candidate solutions in actual environments, accounting for device specifications and disturbances. Utilizes a distributed covariance matrix adaptation evolution strategy for optimization.

Result: The method successfully created adversarial examples that lead to depth misestimations, causing parts of objects to disappear from the target scene.

Conclusion: The study demonstrates the vulnerability of DNN-based MDE models to physical adversarial attacks and validates the effectiveness of the proposed projection-based attack method with physics-in-the-loop optimization.

Abstract: Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization – evaluating candidate solutions in actual environments to account for device specifications and disturbances – and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.

[203] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage, Rico Sennnrich

Main category: cs.CV

TL;DR: A method that enables 2D-trained cross-modal systems to adapt to 3D scenes using in-scene camera control via improved mutual information estimation and regret minimization.

Details

Motivation: Cross-modal systems trained on 2D visual inputs face a dimensional shift when processing 3D scenes, requiring a control module for in-scene cameras to bridge this gap.

Method: Improves multivariate mutual information estimates through regret minimization with derivative-free optimization, enabling off-the-shelf 2D-trained systems to adapt online to object occlusions and differentiate features.

Result: The pipeline improves performance in cross-modal tasks on multi-object 3D scenes without requiring pretraining or finetuning.

Conclusion: Pairing expressive measures with value-based optimization allows effective control of in-scene cameras to learn directly from noisy vision-language model outputs, enabling 2D-trained systems to handle 3D scenes.

Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

[204] Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training

Andrew Tinits, Stephen Mann

Main category: cs.CV

TL;DR: The paper addresses Noise2Noise’s limitation with nonlinear functions on noisy targets by identifying specific nonlinearities that minimize bias, enabling effective HDR image denoising without clean training data.

Details

Motivation: Noise2Noise training requires only noisy image pairs but fails with nonlinear preprocessing of targets, limiting practical applications like HDR image denoising where tone mapping is essential.

Method: Developed theoretical framework to analyze nonlinear function effects, identified class of nonlinear functions with minimal bias, and applied specific loss-tone mapping combinations for HDR Monte Carlo rendering denoising.

Result: Successfully denoised HDR images using only noisy training data, achieving results comparable to original implementation that required high-sample count reference images.

Conclusion: Certain nonlinear functions can be applied to noisy targets in Noise2Noise training with minimal bias, enabling practical applications like HDR image denoising without clean training data.

Abstract: The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.

[205] CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

Md Ahmed Al Muzaddid, Jordan A. James, William J. Beksi

Main category: cs.CV

TL;DR: CropTrack is a novel multiple-object tracking framework for agriculture that combines appearance and motion information to handle repetitive patterns, similar appearances, and frequent occlusions, outperforming traditional motion-based methods.

Details

Motivation: Agricultural MOT faces unique challenges: repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Current trackers rely mainly on motion but struggle with identity preservation during strong occlusions, while appearance-based association is difficult due to high object similarity.

Method: CropTrack integrates appearance and motion information with three key components: 1) reranking-enhanced appearance association, 2) one-to-many association with appearance-based conflict resolution, and 3) exponential moving average prototype feature bank to improve appearance-based association.

Result: CropTrack demonstrates consistent identity preservation on agricultural MOT datasets, outperforming traditional motion-based tracking methods. It achieves significant gains in identification F1 and association accuracy scores with fewer identity switches compared to state-of-the-art methods.

Conclusion: The proposed CropTrack framework effectively addresses agricultural MOT challenges by combining appearance and motion information, providing a robust solution for maintaining object identities despite frequent occlusions and similar appearances in agricultural environments.

Abstract: Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.

Xunyi Zhao, Gengze Zhou, Qi Wu

Main category: cs.CV

TL;DR: MLLMs show poor context awareness and 3D spatial reasoning in embodied navigation tasks despite their general vision-language capabilities, as revealed by the VLN-MME evaluation framework.

Details

Motivation: To investigate MLLMs' potential as embodied agents for vision-language navigation tasks, which require multi-round dialogue, spatial reasoning, and sequential action prediction - capabilities that need further exploration beyond standard vision-language tasks.

Method: Introduces VLN-MME, a unified evaluation framework that bridges traditional navigation datasets into a standardized benchmark for probing MLLMs as zero-shot agents. Uses a highly modular and accessible design to enable structured comparisons and component-level ablations across MLLM architectures, agent designs, and navigation tasks.

Result: Enhancing baseline agents with Chain-of-Thought reasoning and self-reflection leads to unexpected performance decrease, revealing that MLLMs exhibit poor context awareness in embodied navigation tasks. While they can follow instructions and structure output, their 3D spatial reasoning fidelity is low.

Conclusion: VLN-MME lays groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings, revealing limitations in their sequential decision-making capabilities. These findings offer crucial guidance for MLLM post-training as embodied agents.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

[207] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou

Main category: cs.CV

TL;DR: ShowUI-π is a flow-based generative model for GUI agents that enables both discrete clicks and continuous drag actions, addressing limitations of existing GUI agents that can’t handle closed-loop trajectories.

Details

Motivation: Existing GUI agents only support discrete click predictions (x,y coordinates), which prevents them from performing free-form, closed-loop trajectories like dragging progress bars that require continuous perception and adjustment.

Method: Three key designs: 1) Unified Discrete-Continuous Actions integrating clicks and drags in a shared model; 2) Flow-based Action Generation using a lightweight action expert to predict incremental cursor adjustments; 3) Drag Training data and Benchmark with 20K drag trajectories across five domains and ScreenDrag benchmark.

Result: ShowUI-π achieves 26.98 score on ScreenDrag benchmark, outperforming proprietary GUI agents (Operator: 13.27, Gemini-2.5-CUA: 22.18) with only 450M parameters.

Conclusion: The work demonstrates both the difficulty of drag tasks and the effectiveness of the proposed approach, advancing GUI agents toward human-like dexterous control in digital environments.

Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents’ drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

[208] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation

Meng Lan, Lefei Zhang, Xiaomeng Li

Main category: cs.CV

TL;DR: OFL-SAM2 is a prompt-free adaptation of SAM2 for medical image segmentation that uses limited annotated data to train a lightweight mapping network, eliminating the need for manual prompts and achieving state-of-the-art performance with minimal training data.

Details

Motivation: SAM2 shows promise for medical image segmentation but requires extensive annotated data and manual prompts, which are labor-intensive and require medical expertise. There's a need for label-efficient adaptation that eliminates the prompt requirement.

Method: Proposes OFL-SAM2 with two key components: (1) an online few-shot learner that trains a lightweight mapping network using limited data to generate target features, and (2) an adaptive fusion module that dynamically integrates these target features with SAM2’s memory-attention features. The mapping network supports online parameter updates during inference for better generalization.

Result: Extensive experiments on three diverse medical image segmentation datasets demonstrate state-of-the-art performance with limited training data.

Conclusion: OFL-SAM2 successfully adapts SAM2 to medical image segmentation in a prompt-free, label-efficient manner, overcoming the challenges of manual prompting and extensive annotation requirements while maintaining strong performance.

Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model’s generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

[209] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Zichen Tang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia, Zhuodi Hao, Zhongjun Yang, Yuanze Li, Haolin Tian, Xinyi Hu, Peizhi Zhao, Yuan Liu, Zhengyu Wang, Xianghe Wang, Yiling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Ruining Cao, Haocheng Gao

Main category: cs.CV

TL;DR: FinMMDocR is a bilingual multimodal benchmark for evaluating MLLMs on real-world financial numerical reasoning, featuring scenario awareness, complex document understanding, and multi-step computation challenges.

Details

Motivation: Existing benchmarks lack the complexity of real-world financial reasoning, which requires understanding implicit scenarios, processing lengthy financial documents, and performing multi-step computations with cross-page evidence.

Method: Created a benchmark with 1,200 expert-annotated problems incorporating 12 types of implicit financial scenarios, 837 bilingual documents averaging 50.8 pages with rich visual elements, and problems requiring average 11-step reasoning (5.3 extraction + 5.7 calculation steps).

Result: The best-performing MLLM achieves only 58.0% accuracy, and different RAG methods show significant performance variations, demonstrating the benchmark’s challenging nature.

Conclusion: FinMMDocR provides a comprehensive benchmark to drive improvements in MLLMs and reasoning-enhanced methods for complex multimodal reasoning in real-world financial scenarios.

Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

[210] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

Itallo Patrick Castro Alves Da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza, Baldoino Fonseca dos Santos Neto, Marcio de Medeiros Ribeiro

Main category: cs.CV

TL;DR: Comprehensive evaluation shows model compression techniques (quantization, pruning, clustering) can preserve or even improve robustness against natural corruption, with customized combinations offering best multi-objective results for efficient deployment.

Details

Motivation: Compressed deep learning models are essential for resource-constrained devices, but compression may affect robustness under natural corruption. There's a need to evaluate robustness while validating computer vision systems for real-world deployment.

Method: Evaluated compression techniques (quantization, pruning, weight clustering) individually and in combination on CNNs (ResNet-50, VGG-19, MobileNetV2). Used CIFAR-10-C and CIFAR-100-C datasets for robustness testing. Applied multiobjective assessment to analyze trade-offs between robustness, accuracy, and compression ratio.

Result: Certain compression strategies not only preserve but can improve robustness, especially on networks with more complex architectures. Customized technique combinations produce beneficial multi-objective results. Best configurations identified through multiobjective assessment.

Conclusion: Study provides insights for selecting compression methods that maintain robustness while achieving efficiency, enabling robust and efficient model deployment in corrupted real-world environments.

Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.

[211] Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

Bartłomiej Olber, Jakub Winter, Paweł Wawrzyński, Andrii Gamalii, Daniel Górniak, Marcin Łojek, Robert Nowak, Krystian Radlak

Main category: cs.CV

TL;DR: A novel lidar domain adaptation method using neuron activation patterns achieves SOTA performance by annotating only a small, diverse subset of target domain samples, preventing weight drift with continual learning techniques.

Details

Motivation: 3D object detectors trained on standard benchmarks struggle to generalize across different geographic domains (e.g., US-trained models performing poorly in Asia/Europe), creating a need for effective domain adaptation methods that don't require extensive re-annotation.

Method: The method uses neuron activation patterns to identify a small, representative, and diverse subset of target domain samples for annotation. It combines this selective annotation approach with post-training techniques inspired by continual learning to prevent weight drift from the original model.

Result: The proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques in empirical evaluation, achieving strong performance with very small annotation budget.

Conclusion: Effective lidar domain adaptation can be achieved by strategically annotating only a small, well-selected subset of target domain data using neuron activation patterns, combined with continual learning techniques to maintain model stability.

Abstract: 3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

[212] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Xinran Gong, Gorkem Durak, Halil Ertugrul Aktas, Vedat Cicek, Jinkui Hao, Ulas Bagci, Nilay S. Shah, Bo Zhou

Main category: cs.CV

TL;DR: ProDM is a generative diffusion model that corrects motion artifacts in non-gated chest CT scans to improve coronary artery calcium scoring accuracy.

Details

Motivation: Non-gated chest CT scans are widely available for cardiovascular risk assessment but suffer from motion artifacts that degrade CAC scoring accuracy, while ECG-gated cardiac CTs are limited in routine use due to gating requirements and insurance coverage issues.

Method: ProDM uses a three-component approach: 1) CAC motion simulation data engine to synthesize realistic non-gated CTs from gated CTs, 2) property-aware learning with calcium-specific priors and differentiable calcium consistency loss, and 3) progressive correction scheme across diffusion steps for enhanced stability.

Result: ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared to baselines, and a reader study confirms it suppresses motion artifacts and improves clinical usability.

Conclusion: The progressive, property-aware framework demonstrates potential for reliable CAC quantification from routine chest CT imaging, offering an accessible alternative to gated CTs for cardiovascular risk assessment.

Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

[213] VIPER: Process-aware Evaluation for Generative Video Reasoning

Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu

Main category: cs.CV

TL;DR: VIPER is a new benchmark for evaluating video generation models’ reasoning processes, introducing a process-aware evaluation paradigm with a novel metric (POC@r) that measures consistency between intermediate steps and final outcomes.

Details

Motivation: Existing evaluation frameworks for video generation models rely on single-frame assessments, which can lead to outcome-hacking where models reach correct conclusions through erroneous processes. There's a need for process-aware evaluation to assess true reasoning capabilities.

Method: Proposed VIPER benchmark with 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning domains. Introduced Process-outcome Consistency (POC@r) metric using VLM-as-Judge with hierarchical rubric to evaluate both intermediate step validity and final results.

Result: State-of-the-art video models achieve only about 20% POC@1.0 and exhibit significant outcome-hacking. Test-time scaling and sampling robustness analysis reveals substantial gap between current video generation and true generalized visual reasoning.

Conclusion: Current video generation models lack genuine reasoning capabilities despite correct final outcomes. The VIPER benchmark provides a comprehensive framework for process-aware evaluation, highlighting the need for improved reasoning in video generation models.

Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

[214] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

Main category: cs.CV

TL;DR: DarkEQA is a benchmark for evaluating Vision Language Models’ perceptual capabilities under low-light conditions, revealing their limitations in dark environments through physics-based visual degradation simulations.

Details

Motivation: Current VLM benchmarks focus on ideal lighting conditions, but real-world embodied agents need to operate 24/7 in various lighting conditions including low-light environments. There's a gap in evaluating VLMs' robustness to visual degradations like darkness, which is crucial for practical deployment.

Method: Created DarkEQA benchmark with physics-based visual degradations modeled in linear RAW space, simulating illumination drop and sensor noise followed by ISP-inspired rendering. Evaluates question answering from egocentric observations under controlled multi-level low-light conditions to isolate perception bottlenecks.

Result: Systematic evaluation of state-of-the-art VLMs and Low-Light Image Enhancement models reveals significant limitations in VLMs’ performance under challenging low-light conditions, demonstrating the need for improved robustness.

Conclusion: DarkEQA addresses a critical gap in VLM evaluation by providing a benchmark for low-light robustness assessment, revealing current models’ limitations and enabling attributable analysis for future improvements in embodied AI systems.

Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments–a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs’ limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

[215] Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Zhenyu Cui, Jiahuan Zhou, Yuxin Peng

Main category: cs.CV

TL;DR: Proposes RFL-ReID: lifelong person re-identification without re-indexing historical gallery images, addressing privacy and cost issues by making new and old model features compatible.

Details

Motivation: Existing lifelong ReID methods require re-indexing historical gallery images after each model update, which raises privacy concerns and incurs high computational costs for large-scale galleries. This creates feature incompatibility between query features from updated models and gallery features from older models.

Method: Proposes Bidirectional Continuous Compatible Representation (Bi-C2R) framework that continuously updates gallery features extracted by old models to perform efficient lifelong ReID in a compatible manner without re-indexing.

Result: The proposed Bi-C2R method achieves leading performance on both the new RFL-ReID task and traditional L-ReID task across multiple benchmarks, validated through theoretical analysis and extensive experiments.

Conclusion: RFL-ReID is a more challenging but practical task than traditional lifelong ReID, and the Bi-C2R framework effectively addresses feature compatibility issues without requiring re-indexing, making lifelong person re-identification more feasible for real-world applications with privacy and cost constraints.

Abstract: Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as “re-indexing”. However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

[216] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai

Main category: cs.CV

TL;DR: FoundationSLAM is a learning-based monocular dense SLAM system that integrates flow estimation with geometric reasoning using foundation depth models for accurate tracking and mapping.

Details

Motivation: Previous flow-based SLAM approaches lack geometric consistency, leading to inaccurate tracking and mapping. The authors aim to bridge flow estimation with geometric reasoning to achieve more accurate and robust performance.

Method: 1) Hybrid Flow Network producing geometry-aware correspondences for consistent depth and pose inference; 2) Bi-Consistent Bundle Adjustment Layer jointly optimizing keyframe pose and depth under multi-view constraints; 3) Reliability-Aware Refinement mechanism dynamically adapting flow updates by distinguishing reliable vs uncertain regions.

Result: FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, runs in real-time at 18 FPS, and demonstrates strong generalization to various scenarios.

Conclusion: The integration of foundation depth models with flow-based SLAM enables geometric consistency, resulting in a robust, accurate, and practical monocular dense SLAM system with strong generalization capabilities.

Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

[217] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu

Main category: cs.CV

TL;DR: A self-bootstrapping framework transforms visual dubbing from ill-posed inpainting to well-conditioned video editing using Diffusion Transformers to generate ideal training data and train an audio-driven editor for precise lip synchronization.

Details

Motivation: Audio-driven visual dubbing lacks ideal training data where only lip movements differ while all other visual conditions remain identical. Existing mask-based inpainting methods force models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization.

Method: Proposes a self-bootstrapping framework using Diffusion Transformers: 1) DiT as data generator synthesizes ideal training data (lip-altered companion videos for each real sample), 2) DiT-based audio-driven editor trained end-to-end on these aligned video pairs, 3) Timestep-adaptive multi-phase learning strategy to disentangle conflicting editing objectives across diffusion timesteps, 4) ContextDubBench benchmark dataset for robust evaluation.

Result: The method achieves highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios by leveraging complete frame-aligned input conditioning that provides rich visual context including identity cues, scene interactions, and continuous spatiotemporal dynamics.

Conclusion: Reframing visual dubbing from ill-posed inpainting to well-conditioned video-to-video editing with a self-bootstrapping framework enables precise audio-driven lip modifications while preserving visual fidelity and identity, overcoming fundamental limitations of existing approaches.

Abstract: Audio-driven visual dubbing aims to synchronize a video’s lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject’s lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

[218] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang

Main category: cs.CV

TL;DR: SpaceTimePilot is a video diffusion model that independently controls camera viewpoint and motion sequence for continuous space-time exploration from monocular videos.

Details

Motivation: Current video generation models lack explicit disentanglement of spatial (camera viewpoint) and temporal (motion sequence) controls, limiting flexible scene exploration across both dimensions.

Method: Introduces animation time-embedding for motion control, temporal-warping training using multi-view datasets to simulate temporal differences, improved camera-conditioning, and CamxTime dataset for full space-time coverage.

Result: Demonstrates clear space-time disentanglement on real-world and synthetic data, achieving strong results compared to prior work with precise temporal control and robust space-time separation.

Conclusion: SpaceTimePilot successfully achieves independent control over camera viewpoint and motion sequence, enabling continuous exploration of dynamic scenes across both spatial and temporal dimensions through novel training schemes and dataset creation.

Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video’s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

[219] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Dian Shao, Mingfei Shi, Like Liu

Main category: cs.CV

TL;DR: FineTec: A unified framework for fine-grained action recognition from temporally corrupted skeleton sequences using context-aware completion, spatial decomposition, physics-driven acceleration estimation, and GCN-based recognition.

Details

Motivation: Real-world skeleton sequences often suffer from substantial missing data due to online pose estimation errors, making fine-grained action recognition challenging. Existing methods struggle to recover temporal dynamics and preserve subtle spatial structures needed to distinguish similar actions.

Method: 1) Context-aware completion with diverse temporal masking to restore base skeleton sequences; 2) Spatial decomposition into five semantic regions, further divided into dynamic/static subgroups with targeted perturbation for augmentation; 3) Physics-driven estimation using Lagrangian dynamics to compute joint accelerations; 4) GCN-based recognition head processing both fused skeleton positions and accelerations.

Result: Outperforms previous methods on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks under various corruption levels. Achieves 89.1% and 78.1% top-1 accuracy on Gym99-severe and Gym288-severe settings.

Conclusion: FineTec effectively addresses fine-grained action recognition under temporal corruption by integrating context-aware completion, semantic spatial decomposition, physics-driven motion estimation, and multi-modal fusion, demonstrating robustness and generalizability across different datasets and corruption levels.

Abstract: Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.

[220] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Edit3r is a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images, eliminating per-scene optimization and pose estimation requirements.

Details

Motivation: Prior 3D editing methods require per-scene optimization and pose estimation, which are computationally expensive and slow. There's a need for fast, feed-forward approaches that can handle unposed, view-inconsistent edited images without optimization overhead.

Method: Edit3r uses: (1) SAM2-based recoloring strategy to generate cross-view-consistent supervision for training, (2) asymmetric input strategy pairing recolored reference views with raw auxiliary views to encourage fusion and alignment of disparate observations, and (3) a feed-forward architecture that directly predicts instruction-aligned 3D edits.

Result: Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, operates at significantly higher inference speed, and can handle images edited by 2D methods like InstructPix2Pix despite not being trained on such edits. The paper also introduces DL3DV-Edit-Bench benchmark with 20 scenes, 4 edit types, and 100 edits.

Conclusion: Edit3r enables fast, photorealistic 3D scene editing without optimization or pose estimation, making it promising for real-time 3D editing applications. The approach effectively addresses the challenge of training without multi-view consistent edited supervision.

Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

[221] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: GaMO is a geometry-aware multi-view outpainting framework that addresses sparse-view 3D reconstruction by expanding field of view from existing camera poses instead of generating new viewpoints, achieving better geometric consistency and 25× speedup over SOTA diffusion methods.

Details

Motivation: Current diffusion-based methods for sparse-view 3D reconstruction have three critical limitations: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines.

Method: GaMO reformulates sparse-view reconstruction through multi-view outpainting, expanding field of view from existing camera poses rather than generating new viewpoints. It employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training.

Result: Extensive experiments on Replica and ScanNet++ show state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS metrics, while achieving 25× speedup over SOTA diffusion-based methods with processing time under 10 minutes.

Conclusion: GaMO provides an effective solution to sparse-view 3D reconstruction by preserving geometric consistency through multi-view outpainting, offering superior performance and significant computational efficiency improvements over existing approaches.

Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

[222] Matching Semantically Similar Non-Identical Objects

Yusuke Marumo, Kazuhiko Kawamoto, Satomi Tanaka, Shigenobu Hirano, Hiroshi Kera

Main category: cs.CV

TL;DR: A novel pixel-level matching method for non-identical but semantically similar objects using Semantic Enhancement Weighting (SEW) that incorporates object detector semantic information into sparse feature matching.

Details

Motivation: Real-world objects are often similar but not identical (e.g., different dog breeds, car models, flower colors). Existing feature matching methods focus on identical objects from different viewpoints, but there's a need to match semantically similar objects at the pixel level.

Method: Proposes Semantic Enhancement Weighting (SEW) that integrates semantic information from object detectors into existing sparse feature matching methods. This extends matching capabilities from identical objects to semantically similar ones by weighting descriptors based on semantic relevance.

Result: Successful matching between non-identical objects across various scenarios: in-class design variations, class discrepancies, and domain shifts (photo vs. drawing, image corruptions). The method works effectively where traditional approaches fail.

Conclusion: SEW enables robust pixel-level matching of semantically similar but non-identical objects, addressing real-world scenarios where objects vary within categories or across domains. The approach extends the applicability of feature matching beyond identical object correspondence.

Abstract: Not identical but similar objects are ubiquitous in our world, ranging from four-legged animals such as dogs and cats to cars of different models and flowers of various colors. This study addresses a novel task of matching such non-identical objects at the pixel level. We propose a weighting scheme of descriptors, Semantic Enhancement Weighting (SEW), that incorporates semantic information from object detectors into existing sparse feature matching methods, extending their targets from identical objects captured from different perspectives to semantically similar objects. The experiments show successful matching between non-identical objects in various cases, including in-class design variations, class discrepancy, and domain shifts (e.g., photo vs. drawing and image corruptions). The code is available at https://github.com/Circ-Leaf/NIOM .

[223] Reconstructing Hand-Held Objects in 3D from Images and Videos

Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik

Main category: cs.CV

TL;DR: A method for reconstructing hand-held objects from monocular videos using 3D hand estimation and retrieval-augmented reconstruction with generative models.

Details

Motivation: Hand-held objects are challenging to reconstruct from videos due to hand occlusion and small pixel visibility, but 3D hand estimation and the limited set of manipulanda provide strong anchors for reconstruction.

Method: Two-stage approach: (1) MCC-Hand-Object (MCC-HO) for single-frame hand and object reconstruction using RGB images and inferred 3D hands, (2) Retrieval-Augmented Reconstruction (RAR) using GPT-4(V) to retrieve matching 3D object models from a text-to-3D generative model, then rigid alignment across frames.

Result: Achieves state-of-the-art performance on both lab and Internet image/video datasets for hand-held object reconstruction.

Conclusion: The approach successfully leverages 3D hand estimation and generative model retrieval to overcome challenges in hand-held object reconstruction, providing temporally consistent 3D geometry from monocular videos.

Abstract: Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: https://janehwu.github.io/mcc-ho

[224] ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Samar Khanna, Medhanie Irgau, David B. Lobell, Stefano Ermon

Main category: cs.CV

TL;DR: ExPLoRA extends parameter-efficient fine-tuning (PEFT) to self-supervised pre-training, enabling efficient domain adaptation of vision transformers with minimal parameter updates.

Details

Motivation: Current PEFT methods focus on supervised fine-tuning, but there's an unexplored question: can we efficiently adapt pre-trained models to new domains via self-supervised pre-training without labels?

Method: Initialize ViT with pre-trained weights, continue unsupervised pre-training on new domain by unfreezing 1-2 blocks and tuning all other layers with LoRA, then fine-tune with LoRA for supervised learning.

Result: Achieves state-of-the-art results on satellite imagery, outperforming full pre-training/fine-tuning. Shows up to 8% improvement in linear probing accuracy while using <10% of parameters compared to fully-tuned approaches.

Conclusion: ExPLoRA effectively bridges the gap between efficient fine-tuning and domain adaptation, demonstrating that parameter-efficient self-supervised pre-training can significantly improve transfer learning under domain shifts.

Abstract: Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the-art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available on the project website: https://samar-khanna.github.io/ExPLoRA/

[225] MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Wenqian Ye, Bohan Liu, Guangtao Zheng, Di Wang, Yunsheng Ma, Xu Cao, Bolin Lai, James M. Rehg, Aidong Zhang

Main category: cs.CV

TL;DR: This paper introduces MM-SpuBench, a benchmark for evaluating spurious biases in Multimodal Large Language Models (MLLMs), revealing their persistent reliance on spurious correlations despite strong vision-language capabilities.

Details

Motivation: Spurious biases (exploiting superficial correlations between input attributes and targets) are a known robustness issue in classical ML, but their presence and severity in Multimodal Large Language Models (MLLMs) remain poorly understood despite MLLMs' strong vision-language capabilities.

Method: The authors create MM-SpuBench, a comprehensive human-verified benchmark with image-class pairs annotated with core and spurious attributes, based on a taxonomy of nine spurious correlation types. They evaluate state-of-the-art MLLMs using both standard accuracy and a proposed Conditional Generation Likelihood Advantage (CGLA) metric.

Result: The evaluation reveals persistent reliance on spurious correlations in MLLMs, showing that mitigation is difficult even on their carefully constructed benchmark. Both open-source and proprietary models exhibit these biases.

Conclusion: Spurious biases remain a significant challenge in MLLMs, and the proposed MM-SpuBench benchmark can inspire new technical approaches to mitigate these biases. The benchmark is publicly available for further research.

Abstract: Spurious bias, a tendency to exploit spurious correlations between superficial input attributes and prediction targets, has revealed a severe robustness pitfall in classical machine learning problems. Multimodal Large Language Models (MLLMs), which leverage pretrained vision and language models, have recently demonstrated strong capability in joint vision-language understanding. However, both the presence and severity of spurious biases in MLLMs remain poorly understood. In this work, we address this gap by analyzing the spurious biases in the multimodal setting and uncovering the specific inference-time data patterns that can manifest this problem. To support this analysis, we introduce MM-SpuBench, a comprehensive, human-verified benchmark dataset consisting of image-class pairs annotated with core and spurious attributes, grounded in our taxonomy of nine distinct types of spurious correlations. The benchmark is constructed using human-interpretable attribute information to capture a wide range of spurious patterns reflective of real-world knowledge. Leveraging this benchmark, we conduct a comprehensive evaluation of the state-of-the-art open-source and proprietary MLLMs with both standard accuracy and the proposed Conditional Generation Likelihood Advantage (CGLA). Our findings highlight the persistence of reliance on spurious correlations and the difficulty of mitigation on our benchmark. We hope this work can inspire new technical strides to mitigate these biases. Our benchmark is publicly available at https://huggingface.co/datasets/mmbench/MM-SpuBench.

[226] DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Chang-Han Yeh, Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Ting-Hsuan Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: DiffIR2VR-Zero enables any pre-trained image restoration diffusion model to perform high-quality video restoration without additional training, solving temporal inconsistency issues through hierarchical latent warping and hybrid token merging.

Details

Motivation: Image diffusion models have strong restoration capabilities but cause temporal inconsistencies when applied to videos, and existing video restoration methods require extensive retraining for different degradation types.

Method: Two key innovations: 1) hierarchical latent warping strategy for consistency across keyframes and local frames, 2) hybrid token merging mechanism that adaptively combines optical flow and feature matching.

Result: Achieves superior temporal consistency across diverse datasets and degradation conditions (including 8× super-resolution and severe noise) while maintaining high-quality restoration of base diffusion models.

Conclusion: Provides a versatile zero-shot framework that works with any image restoration diffusion model for video enhancement without task-specific training or modifications.

Abstract: We present DiffIR2VR-Zero, a zero-shot framework that enables any pre-trained image restoration diffusion model to perform high-quality video restoration without additional training. While image diffusion models have shown remarkable restoration capabilities, their direct application to video leads to temporal inconsistencies, and existing video restoration methods require extensive retraining for different degradation types. Our approach addresses these challenges through two key innovations: a hierarchical latent warping strategy that maintains consistency across both keyframes and local frames, and a hybrid token merging mechanism that adaptively combines optical flow and feature matching. Through extensive experiments, we demonstrate that our method not only maintains the high-quality restoration of base diffusion models but also achieves superior temporal consistency across diverse datasets and degradation conditions, including challenging scenarios like 8$\times$ super-resolution and severe noise. Importantly, our framework works with any image restoration diffusion model, providing a versatile solution for video enhancement without task-specific training or modifications. Project page: https://jimmycv07.github.io/DiffIR2VR_web/

[227] Explaining Object Detectors via Collective Contribution of Pixels

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

Main category: cs.CV

TL;DR: Proposes a game-theoretic method using Shapley values and interactions to explain object detectors by capturing both individual and collective pixel contributions for bounding box localization and class determination.

Details

Motivation: Existing visual explanation methods for object detectors focus only on individual pixel contributions, neglecting collective influences of multiple pixels, which can lead to missing compositional cues or capturing spurious correlations.

Method: Game-theoretic approach based on Shapley values and interactions that explicitly captures both individual and collective pixel contributions, providing explanations for both bounding box localization and class determination.

Result: Extensive experiments show the method identifies important regions more accurately than state-of-the-art methods, with code publicly available.

Conclusion: The proposed method addresses limitations of existing approaches by considering collective pixel contributions, providing more accurate visual explanations for object detectors that highlight crucial regions for detection.

Abstract: Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE

[228] INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Inst-IT enhances Large Multimodal Models’ instance-level understanding through explicit visual prompt instruction tuning, improving both fine-grained instance comprehension and general multimodal capabilities.

Details

Motivation: Current LMMs excel at holistic image/video understanding but struggle with fine-grained instance-level comprehension, which is crucial for focusing on specific elements of interest. Existing models show strong instance understanding when given explicit visual cues, suggesting this approach can be leveraged.

Method: Inst-IT includes: 1) a benchmark to diagnose multimodal instance-level understanding, 2) a large-scale instruction-tuning dataset, and 3) a continuous instruction-tuning training paradigm using explicit visual prompts for instance guidance.

Result: Models enhanced by Inst-IT achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, while also showing significant improvements across various generic image and video understanding benchmarks.

Conclusion: The method not only boosts instance-level understanding but also strengthens overall generic image and video comprehension capabilities, demonstrating that explicit visual prompt instruction tuning effectively enhances both fine-grained and general multimodal understanding.

Abstract: Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning for instance guidance. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, enhanced by Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

[229] Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Bohan Li, Jiajun Deng, Yasheng Sun, Xiaofeng Wang, Xin Jin, Wenjun Zeng

Main category: cs.CV

TL;DR: Hi-SOP introduces hierarchical context alignment for 3D semantic occupancy prediction, addressing feature misalignment issues through disentangled geometric and temporal alignment branches.

Details

Motivation: Existing SOP methods suffer from misalignment issues where corresponding features at the same position across different frames may have different semantic meanings during aggregation, leading to unreliable contextual fusion and unstable representation learning.

Method: Hi-SOP uses hierarchical context alignment with: (I) disentangled geometric and temporal separate alignment using depth confidence and camera pose as priors, and (II) global alignment and composition of transformed geometric and temporal volumes based on semantic consistency.

Result: Outperforms state-of-the-art methods for semantic scene completion on SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset.

Conclusion: The hierarchical context alignment paradigm effectively addresses feature misalignment in 3D semantic occupancy prediction, leading to more accurate and reliable scene understanding from 2D camera observations.

Abstract: Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.

[230] OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, Kai Han

Main category: cs.CV

TL;DR: OnlineVPO introduces a video-specific preference learning framework for video diffusion models using video quality assessment models as feedback and an online DPO algorithm for scalable optimization.

Details

Motivation: Current video diffusion models suffer from degraded image quality and flickering artifacts. Existing preference learning methods adopt image-domain routines without proper investigation into video-specific optimization, leading to modality gaps and inefficient feedback.

Method: 1) Uses video quality assessment (VQA) models as human-aligned feedback sources instead of image-level reward models; 2) Introduces an online DPO algorithm with online preference generation and curriculum preference update designs for scalable optimization of higher-resolution, longer-duration videos.

Result: Extensive experiments demonstrate OnlineVPO as a simple yet effective and scalable preference learning algorithm for video diffusion models, addressing the modality gap and insufficient optimization issues of existing methods.

Conclusion: OnlineVPO provides a tailored preference learning framework for video diffusion models that better aligns with human perception through VQA-based feedback and enables scalable optimization via online DPO, overcoming limitations of existing image-domain approaches.

Abstract: Video diffusion models (VDMs) have demonstrated remarkable capabilities in text-to-video (T2V) generation. Despite their success, VDMs still suffer from degraded image quality and flickering artifacts. To address these issues, some approaches have introduced preference learning to exploit human feedback to enhance the video generation. However, these methods primarily adopt the routine in the image domain without an in-depth investigation into video-specific preference optimization. In this paper, we reexamine the design of the video preference learning from two key aspects: feedback source and feedback tuning methodology, and present OnlineVPO, a more efficient preference learning framework tailored specifically for VDMs. On the feedback source, we found that the image-level reward model commonly used in existing methods fails to provide a human-aligned video preference signal due to the modality gap. In contrast, video quality assessment (VQA) models show superior alignment with human perception of video quality. Building on this insight, we propose leveraging VQA models as a proxy of humans to provide more modality-aligned feedback for VDMs. Regarding the preference tuning methodology, we introduce an online DPO algorithm tailored for VDMs. It not only enjoys the benefits of superior scalability in optimizing videos with higher resolution and longer duration compared with the existing method, but also mitigates the insufficient optimization issue caused by off-policy learning via online preference generation and curriculum preference update designs. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and, more importantly, scalable preference learning algorithm for video diffusion models.

[231] EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, Nan Cao

Main category: cs.CV

TL;DR: EmotiCrafter: A model for continuous emotional image generation using text prompts and Valence-Arousal values to capture complex emotional nuances.

Details

Motivation: Existing emotional image generation methods rely on discrete emotion categories, which fail to capture complex and subtle emotional nuances. They also struggle to control specific image content based on text prompts.

Method: Proposes EmotiCrafter with a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, and introduces a loss function to enhance emotion expression.

Result: The method effectively generates images representing specific emotions with desired content and outperforms existing techniques.

Conclusion: Introduces the new task of continuous emotional image content generation (C-EICG) and presents a successful approach using Valence-Arousal values for more nuanced emotional image generation.

Abstract: Recent research shows that emotions can enhance users’ cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.

[232] MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, Ivan Oseledets

Main category: cs.CV

TL;DR: MaxInfo is a training-free frame selection method for Video LLMs that uses maximum volume principle to pick the most informative frames, reducing redundancy and improving video understanding performance without additional training.

Details

Motivation: Uniform frame sampling in Video LLMs often misses critical information due to frame redundancy and content variations, leading to suboptimal video understanding.

Method: MaxInfo selects frames by maximizing the geometric volume formed by selected embeddings, ensuring coverage of informative regions in embedding space. Comes in Fast/Slow versions and Chunk-based version for long videos.

Result: Achieves 3.28% improvement on LongVideoBench and 6.4% improvement on EgoSchema for LLaVA-Video-7B. Also boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5.

Conclusion: MaxInfo provides a simple, training-free, low-latency alternative to uniform sampling that enhances video comprehension across benchmarks and works with existing VLLMs.

Abstract: Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, the first training-free method based on the maximum volume principle, which is available in Fast and Slow versions and a Chunk-based version that selects and retains the most representative frames from a video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. Moreover, MaxInfo boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5. The approach is simple to implement and works with existing VLLMs without the need for additional training and very lower latency, making it a practical and effective alternative to traditional uniform sampling methods. Our code are available at https://github.com/FusionBrainLab/MaxInfo.git

[233] An Empirical Study of Methods for Small Object Detection from Satellite Imagery

Xiaohui Yuan, Aniv Chakravarty, Lichuan Gu, Zhenchun Wei, Elinor Lichtenberg, Tian Chen

Main category: cs.CV

TL;DR: Review and empirical evaluation of object detection methods for small objects in remote sensing imagery, focusing on car detection in urban areas and bee box detection in agricultural lands.

Details

Motivation: To understand the performance and technical challenges of state-of-the-art object detection methods specifically for small objects in remote sensing imagery, which is important for applications like urban monitoring and agricultural management.

Method: 1. Review existing object detection methods for small objects in remote sensing imagery. 2. Select four state-of-the-art methods based on existing surveys and literature. 3. Conduct empirical evaluation using public, high-resolution satellite image datasets. 4. Focus on two application scenarios: car detection from urban satellite images and bee box detection from agricultural satellite images.

Result: The paper provides empirical evaluation results of four state-of-the-art object detection methods on small object detection tasks in remote sensing imagery, though specific performance metrics are not detailed in the abstract.

Conclusion: The study offers insights into method performance and technical challenges for small object detection in remote sensing imagery, providing practical guidance for researchers and practitioners working on applications like urban monitoring and agricultural management using satellite imagery.

Abstract: This paper reviews object detection methods for finding small objects from remote sensing imagery and provides an empirical evaluation of four state-of-the-art methods to gain insights into method performance and technical challenges. In particular, we use car detection from urban satellite images and bee box detection from satellite images of agricultural lands as application scenarios. Drawing from the existing surveys and literature, we identify several top-performing methods for the empirical study. Public, high-resolution satellite image datasets are used in our experiments.

[234] Daily Land Surface Temperature Reconstruction in Landsat Cross-Track Areas Using Deep Ensemble Learning With Uncertainty Quantification

Shengjie Liu, Siqin Wang, Lu Zhang

Main category: cs.CV

TL;DR: DELAG is a deep ensemble learning method that reconstructs high-resolution Landsat land surface temperature (LST) in complex urban areas by integrating annual temperature cycles and Gaussian processes, achieving superior performance under both clear-sky and cloudy conditions.

Details

Motivation: Land surface temperature data at high spatiotemporal resolution is crucial for urban applications, but Landsat's long revisit time and cloud cover limitations disrupt data collection, especially in complex urban areas where LST exhibits significant spatial variations.

Method: DELAG combines deep ensemble learning with annual temperature cycles and Gaussian processes to reconstruct Landsat LST. It leverages Landsat’s cross-track characteristics and dual-satellite operation (since 2021) to enhance data availability to 4 scenes every 16 days.

Result: DELAG successfully reconstructed LST in New York City, London, and Hong Kong with RMSE of 0.73-0.96 K under clear-sky and 0.84-1.62 K under heavily-cloudy conditions, outperforming existing methods. It also provides uncertainty quantification and enables accurate near-surface air temperature estimation (RMSE = 1.48-2.11 K).

Conclusion: DELAG provides a novel and practical method for Landsat LST reconstruction in complex urban areas, particularly within Landsat cross-track areas, advancing high spatiotemporal resolution climate monitoring and enabling broader applications like air temperature estimation.

Abstract: Many real-world applications rely on land surface temperature (LST) data at high spatiotemporal resolution. In complex urban areas, LST exhibits significant variations, fluctuating dramatically within and across city blocks. Landsat provides high spatial resolution data at 100 meters but is limited by long revisit time, with cloud cover further disrupting data collection. Here, we propose DELAG, a deep ensemble learning method that integrates annual temperature cycles and Gaussian processes, to reconstruct Landsat LST in complex urban areas. Leveraging the cross-track characteristics and dual-satellite operation of Landsat since 2021, we further enhance data availability to 4 scenes every 16 days. We select New York City, London and Hong Kong from three different continents as study areas. Experiments show that DELAG successfully reconstructed LST in the three cities under clear-sky (RMSE = 0.73-0.96 K) and heavily-cloudy (RMSE = 0.84-1.62 K) situations, superior to existing methods. Additionally, DELAG can quantify uncertainty that enhances LST reconstruction reliability. We further tested the reconstructed LST to estimate near-surface air temperature, achieving results (RMSE = 1.48-2.11 K) comparable to those derived from clear-sky LST (RMSE = 1.63-2.02 K). The results demonstrate the successful reconstruction through DELAG and highlight the broader applications of LST reconstruction for estimating accurate air temperature. Our study thus provides a novel and practical method for Landsat LST reconstruction, particularly suited for complex urban areas within Landsat cross-track areas, taking one step toward addressing complex climate events at high spatiotemporal resolution. Code and data are available at https://skrisliu.com/delag

[235] SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition

Shanshan Wan, Yingmei Wei, Lai Kang, Tianrui Shen, Haixuan Wang, Yee-Hong Yang

Main category: cs.CV

TL;DR: SciceVPR improves Visual Place Recognition by better utilizing DINOv2 features through multi-layer fusion and stable cross-image correlation distillation, achieving state-of-the-art performance with single-stage efficiency.

Details

Motivation: Current VPR methods using DINOv2 only utilize its final output and suffer from unstable cross-image correlation, leading to inconsistent retrieval results. There's a need for more discriminative and stable global descriptors that work well across domain shifts.

Method: SciceVPR uses: 1) Multi-layer feature fusion to capture detailed channel and spatial information from DINOv2’s multiple layers, and 2) Self-enhanced encoder that distills invariant cross-image correlations within batches to learn robust features resilient to domain shifts.

Result: SciceVPR-B outperforms SOTA one-stage methods on multiple datasets. SciceVPR-L matches SOTA two-stage models, achieving over 3% higher Recall@1 on the challenging Tokyo24/7 dataset.

Conclusion: By fully exploiting DINOv2’s multi-layer features and learning stable cross-image correlations, SciceVPR produces discriminative and consistent global descriptors that handle domain shifts effectively while maintaining computational efficiency.

Abstract: Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at https://github.com/shuimushan/SciceVPR.

[236] Simple Self Organizing Map with Visual Transformer

Alan Luo, Kaiwen Yuan

Main category: cs.CV

TL;DR: ViTs underperform on small datasets due to lack of inductive biases. This paper explores synergies between Vision Transformers and Self-Organizing Maps to address this limitation, showing they can enhance each other’s performance in both unsupervised and supervised tasks.

Details

Motivation: Vision Transformers lack inductive biases that cause them to underperform on smaller datasets. Current solutions use indirect methods like pretext tasks or CNN knowledge distillation. Self-Organizing Maps inherently preserve topology and spatial organization, making them promising for directly addressing ViT limitations, but their integration with modern deep learning architectures remains unexplored.

Method: The paper conducts a novel exploration of how Vision Transformers and Self-Organizing Maps can empower each other. It bridges the research gap by investigating synergistic integration between these architectures, though specific technical details of the integration approach are not provided in the abstract.

Result: The findings demonstrate that Vision Transformers and Self-Organizing Maps can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code is publicly available on GitHub.

Conclusion: The study successfully bridges the research gap by showing that Vision Transformers and Self-Organizing Maps can mutually enhance each other, offering a promising direction for improving ViT performance on smaller datasets through direct integration with topology-preserving SOM architectures.

Abstract: Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code is publicly available on GitHub.

[237] Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild

S M A Sharif, Abdur Rehman, Zain Ul Abidin, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi

Main category: cs.CV

TL;DR: The paper introduces LSD, a large-scale 4K+ low-light dataset with 6,425 aligned low/normal-light pairs, and TFFormer, a hybrid model with separate luminance-chrominance encoding that achieves SOTA performance on low-light enhancement and downstream tasks.

Details

Motivation: Single-shot low-light image enhancement (SLLIE) faces challenges due to limited availability of diverse, real-world paired datasets. Existing datasets lack scale, resolution, and real-world diversity needed for robust model training and evaluation.

Method: 1) Created LSD dataset: 6,425 aligned low/normal-light pairs (4K+) from 8,000+ dynamic scenes across 0.1-200 lux range, plus 2,117 unpaired images for generalization testing. 2) Proposed TFFormer: hybrid model with separate luminance-chrominance encoding to reduce color-structure entanglement, cross-attention-driven joint decoder for context-aware fusion, LC refinement, and LC-guided supervision.

Result: TFFormer achieves state-of-the-art results on LSD (+2.45 dB PSNR improvement) and substantially improves downstream vision tasks, such as low-light object detection (+6.80 mAP on ExDark dataset).

Conclusion: The LSD dataset addresses the data scarcity problem in low-light enhancement, and TFFormer’s separate luminance-chrominance encoding with cross-attention fusion effectively enhances perceptual fidelity and structural consistency, demonstrating strong performance on both enhancement quality and downstream applications.

Abstract: Single-shot low-light image enhancement (SLLIE) remains challenging due to the limited availability of diverse, real-world paired datasets. To bridge this gap, we introduce the Low-Light Smartphone Dataset (LSD), a large-scale, high-resolution (4K+) dataset collected in the wild across a wide range of challenging lighting conditions (0.1 to 200 lux). LSD contains 6,425 precisely aligned low and normal-light image pairs, selected from over 8,000 dynamic indoor and outdoor scenes through multi-frame acquisition and expert evaluation. To evaluate generalization and aesthetic quality, we collect 2,117 unpaired low-light images from previously unseen devices. To fully exploit LSD, we propose TFFormer, a hybrid model that encodes luminance and chrominance (LC) separately to reduce color-structure entanglement. We further propose a cross-attention-driven joint decoder for context-aware fusion of LC representations, along with LC refinement and LC-guided supervision to significantly enhance perceptual fidelity and structural consistency. TFFormer achieves state-of-the-art results on LSD (+2.45 dB PSNR) and substantially improves downstream vision tasks, such as low-light object detection (+6.80 mAP on ExDark).

[238] Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour

Main category: cs.CV

TL;DR: Paper introduces embedding-based data heterogeneity for FL beyond label skew, showing it better captures performance degradation in vision tasks.

Details

Motivation: Existing FL research uses label distribution skew to model data heterogeneity, but this fails to fully capture heterogeneity in computer vision tasks beyond classification.

Method: Use pre-trained deep neural networks to extract task-specific embeddings, define embedding-based heterogeneity, cluster data based on embeddings, and distribute using Dirichlet distribution.

Result: Embedding-based heterogeneity leads to up to ~60% increase in observed loss under FedAvg across seven vision tasks, better exposing performance degradation from data heterogeneity.

Conclusion: Proposed embedding-based heterogeneity more accurately captures data heterogeneity in vision tasks, reveals new benchmark measures, and opens research directions for FL.

Abstract: Federated Learning (FL) has emerged as one of the prominent paradigms for distributed machine learning (ML). However, it is well-established that its performance can degrade significantly under non-IID (non-independent and identically distributed) data distributions across clients. To study this effect, the existing works predominantly emulate data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the data heterogeneity in computer vision tasks beyond classification, exposing an overlooked gap in the literature. Motivated by this, by utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. For instance, across seven representative computer vision tasks, our embedding-based heterogeneity formulation leads to up to around 60% increase in the observed loss under FedAvg, indicating that it more accurately exposes the performance degradation caused by data heterogeneity. We further unveil a series of open research directions that can be pursued.

[239] Text-to-Image Models and Their Representation of People from Different Nationalities Engaging in Activities

Abdulkareem Alsudais

Main category: cs.CV

TL;DR: T2I models DALL-E 3 and Gemini 3 Pro Preview show systematic biases in depicting people from different nationalities, disproportionately showing traditional attire for certain regions and income groups, with pipeline components (generators, evaluators, prompt revisions) all contributing to these representational patterns.

Details

Motivation: To investigate how popular text-to-image models represent people from different nationalities in everyday activities, and to understand potential biases in their depictions, particularly regarding traditional attire and regional/income-based patterns.

Method: Generated 2,060 images using DALL-E 3 and Gemini 3 Pro Preview with prompts specifying 206 nationalities across five everyday activities. Analyzed traditional attire representation, impractical attire in athletics, and used CLIP, ALIGN, and GPT-4.1 mini to score 9,270 image-prompt pairs for alignment. Conducted statistical analysis of regional and income group patterns.

Result: 28.4% of images depicted traditional attire, often impractical for activities. Middle East & North Africa and Sub-Saharan Africa disproportionately affected, with similar patterns for World Bank income groups. Traditional attire images received higher alignment scores when country names were included. One model frequently inserted “traditional” in prompts (50.3% for traditional-labeled images vs. 16.6% otherwise).

Conclusion: Text-to-image models exhibit systematic representational biases shaped by multiple pipeline components (image generators, evaluation models, prompt revisions), disproportionately depicting certain regions and income groups in traditional attire, which can reinforce stereotypes and affect image-text alignment.

Abstract: This paper investigates how popular text-to-image (T2I) models, DALL-E 3 and Gemini 3 Pro Preview, depict people from 206 nationalities when prompted to generate images of individuals engaging in common everyday activities. Five scenarios were developed, and 2,060 images were generated using input prompts that specified nationalities across five activities. When aggregating across activities and models, results showed that 28.4% of the images depicted individuals wearing traditional attire, including attire that is impractical for the specified activities in several cases. This pattern was statistically significantly associated with regions, with the Middle East & North Africa and Sub-Saharan Africa disproportionately affected, and was also associated with World Bank income groups. Similar region- and income-linked patterns were observed for images labeled as depicting impractical attire in two athletics-related activities. To assess image-text alignment, CLIP, ALIGN, and GPT-4.1 mini were used to score 9,270 image-prompt pairs. Images labeled as featuring traditional attire received statistically significantly higher alignment scores when prompts included country names, and this pattern weakened or reversed when country names were removed. Revised prompt analysis showed that one model frequently inserted the word “traditional” (50.3% for traditional-labeled images vs. 16.6% otherwise). These results indicate that these representational patterns can be shaped by several components of the pipeline, including image generator, evaluation models, and prompt revision.

[240] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

Main category: cs.CV

TL;DR: CPL improves AiOIR by learning discriminative prompts through sparse representation and contrastive regularization, achieving SOTA performance across multiple benchmarks.

Details

Motivation: Current AiOIR approaches struggle with task-aware prompts: adaptive prompts lead to overlapping representations, while classifier-based prompts lose visual reconstruction information needed for restoration.

Method: Contrastive Prompt Learning (CPL) with two components: Sparse Prompt Module (SPM) for efficient degradation-aware representation with reduced redundancy, and Contrastive Prompt Regularization (CPR) that strengthens task boundaries using negative prompts across degradation types.

Result: Extensive experiments across five benchmarks show CPL consistently boosts strong AiOIR baselines, achieving state-of-the-art average performance across diverse scenarios.

Conclusion: CPL provides a general and robust solution for AiOIR by directly optimizing prompt-restoration model interaction rather than focusing solely on degradation classification.

Abstract: All-in-One Image Restoration (AiOIR), which addresses diverse degradation types with a unified model, presents significant challenges in designing task-aware prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but discard critical visual information needed for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a framework that aims to improve prompt-task alignment through two complementary components: a Sparse Prompt Module (SPM) that efficiently captures degradation-aware representations while reducing redundancy, and a Contrastive Prompt Regularization (CPR) that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL directly optimizes the interaction between prompts and the restoration model. Extensive experiments across five benchmarks show that CPL consistently boosts the performance of strong AiOIR baselines across diverse scenarios. Our approach achieves state-of-the-art average performance on these benchmarks, providing a general and robust solution for AiOIR. The code is available at https://github.com/Aitical/CPLIR

[241] Adapting In-Domain Few-Shot Segmentation to New Domains without Source Domain Retraining

Qi Fan, Kaiqi Liu, Nian Liu, Hisham Cholakkal, Rao Muhammad Anwer, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: ISA adapts pre-trained FSS models to new domains without retraining by identifying domain-specific model structures via Fisher scores and progressively training them with few-shot support samples.

Details

Motivation: Cross-domain few-shot segmentation faces challenges due to domain shifts and limited support data. Existing methods require costly redesign and retraining of models using abundant source domain data, which is inefficient.

Method: 1) Adaptively identify domain-specific model structures using a novel structure Fisher score to measure parameter importance. 2) Progressively train selected informative structures with hierarchically constructed training samples from fewer to more support shots.

Result: Extensive experiments show superior performance across multiple CD-FSS benchmarks, demonstrating effective domain adaptation without source domain retraining.

Conclusion: ISA provides flexible adaptation capabilities for existing FSS models to handle new domains, eliminating the need for costly redesign or retraining while effectively addressing domain shifts.

Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using abundant base data from the source domain, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for source domain retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks. Codes are at https://github.com/fanq15/ISA.

[242] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang

Main category: cs.CV

TL;DR: MCITlib is a comprehensive library for Multimodal Continual Instruction Tuning that implements 8 algorithms, evaluates on 3 benchmarks with 2 backbone models to address challenges in Multimodal Continual Learning.

Details

Motivation: The rise of Multimodal Large Language Models (MLLMs) introduces new challenges in Multimodal Continual Learning (MCL), where models must handle both catastrophic forgetting and cross-modal coordination, requiring specialized tools for research advancement.

Method: Developed MCITlib, a comprehensive library that implements 8 representative algorithms for Multimodal Continual Instruction Tuning, with evaluations conducted on 3 benchmarks using 2 backbone models.

Result: Created an open-source library (MCITlib) that provides tools for MCL research, currently supporting multiple algorithms and benchmarks, with plans for continuous updates to support future developments in the field.

Conclusion: MCITlib addresses the need for specialized tools in Multimodal Continual Learning research and will serve as a foundation for advancing the field through ongoing updates and community contributions.

Abstract: Continual learning enables AI systems to acquire new knowledge while retaining previously learned information. While traditional unimodal methods have made progress, the rise of Multimodal Large Language Models (MLLMs) brings new challenges in Multimodal Continual Learning (MCL), where models are expected to address both catastrophic forgetting and cross-modal coordination. To advance research in this area, we present MCITlib, a comprehensive library for Multimodal Continual Instruction Tuning. MCITlib currently implements 8 representative algorithms and conducts evaluations on 3 benchmarks under 2 backbone models. The library will be continuously updated to support future developments in MCL. The codebase is released at https://github.com/Ghy0501/MCITlib.

[243] GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

Haibin He, Jing Zhang, Maoyuan Ye, Juhua Liu, Bo Du, Dacheng Tao

Main category: cs.CV

TL;DR: GoMatching++ transforms image text spotters into video specialists by freezing the image model and adding a lightweight tracker, achieving state-of-the-art performance on video text spotting benchmarks with minimal training data.

Details

Motivation: Current video text spotting methods underperform compared to image text spotting due to limited recognition capability, even after extensive end-to-end training. There's a need for parameter- and data-efficient approaches that can leverage existing image text spotters for video tasks.

Method: Freezes an off-the-shelf image text spotter and introduces a lightweight, trainable tracker with two key components: (1) a rescoring mechanism to bridge image-video domain gap, and (2) LST-Matcher to enhance the frozen spotter’s video text handling capability. Explores various LST-Matcher architectures for efficiency.

Result: Sets new performance records on ICDAR15-video, DSText, and BOVText benchmarks while significantly reducing training costs. Introduces ArTVideo benchmark with over 30% curved text annotations.

Conclusion: GoMatching++ provides an efficient way to transform image text spotters into video specialists, and the ArTVideo benchmark addresses the lack of curved text datasets in VTS, both contributing to future advancements in video text spotting.

Abstract: Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter’s ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.

Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, Roni Sengupta

Main category: cs.CV

TL;DR: TalkingHeadBench is a new benchmark and dataset for evaluating deepfake talking-head detection methods against the most advanced generative models, addressing limitations of current outdated benchmarks.

Details

Motivation: Current deepfake talking-head detection benchmarks are outdated, using old generators and providing limited insight into model robustness and generalization, while real-world deepfake technology has advanced significantly posing substantial risks.

Method: Created a comprehensive multi-model multi-generator benchmark with deepfakes from leading academic and commercial models, featuring protocols to assess generalization under distribution shifts in identity and generator characteristics.

Result: Benchmarked diverse detection methods (CNNs, vision transformers, temporal models) and analyzed their robustness and generalization capabilities, providing error analysis with Grad-CAM visualizations to expose failure modes and detector biases.

Conclusion: TalkingHeadBench provides an open-access benchmark to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

Abstract: The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a comprehensive multi-model multi-generator benchmark and curated dataset designed to evaluate the performance of state-of-the-art detectors on the most advanced generators. Our dataset includes deepfakes synthesized by leading academic and commercial models and features carefully constructed protocols to assess generalization under distribution shifts in identity and generator characteristics. We benchmark a diverse set of existing detection methods, including CNNs, vision transformers, and temporal models, and analyze their robustness and generalization capabilities. In addition, we provide error analysis using Grad-CAM visualizations to expose common failure modes and detector biases. TalkingHeadBench is hosted on https://huggingface.co/datasets/luchaoqi/TalkingHeadBench with open access to all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

[245] Controllable Human-centric Keyframe Interpolation with Generative Prior

Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy

Main category: cs.CV

TL;DR: PoseFuse3D-KI integrates 3D human guidance into video diffusion for controllable human-centric keyframe interpolation, outperforming existing methods by 9% PSNR and 38% LPIPS reduction.

Details

Motivation: Existing video interpolation methods struggle with complex human motions due to lack of 3D geometric guidance and offer limited control over synthesized dynamics.

Method: PoseFuse3D-KI framework integrates 3D human guidance via a novel SMPL-X encoder that transforms 3D geometry/shape into 2D latent conditioning, combined with fusion network integrating 3D cues with 2D pose embeddings.

Result: Outperforms state-of-the-art baselines on CHKI-Video dataset with 9% PSNR improvement and 38% LPIPS reduction. Comprehensive ablations show improved interpolation fidelity.

Conclusion: Integrating 3D human guidance into diffusion process enables more plausible and controllable human-centric keyframe interpolation for complex articulated motions.

Abstract: Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Pengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, Sifeng He

Main category: cs.CV

TL;DR: MAPLE uses MLLMs’ inherent alignment capabilities to guide cross-modal representation learning via preference optimization, achieving better fine-grained retrieval than previous methods.

Details

Motivation: Despite CLIP's cross-modal retrieval capabilities, there's still a substantial modality gap. While MLLMs show inherent alignment properties, existing MLLM-based retrievers use coarse alignment mechanisms that limit their potential.

Method: MAPLE leverages MLLMs’ fine-grained alignment priors through reinforcement learning: (1) automatic preference data construction using off-the-shelf MLLMs, and (2) Relative Preference Alignment (RPA) loss that adapts Direct Preference Optimization to embedding learning.

Result: Experimental results show substantial gains in fine-grained cross-modal retrieval, demonstrating effectiveness in handling nuanced semantic distinctions.

Conclusion: MAPLE successfully bridges the modality gap by using MLLMs’ alignment capabilities to guide cross-modal representation learning through preference optimization, outperforming previous methods in fine-grained retrieval tasks.

Abstract: Despite Contrastive Language-Image Pretraining (CLIP)’s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.

[247] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

Main category: cs.CV

TL;DR: OmniVCus: A diffusion Transformer framework for multi-subject video customization with control signals like depth and mask, using novel data construction and embedding mechanisms.

Details

Motivation: Existing methods focus on single-subject video customization due to lack of multi-subject training data, and there's limited exploration of using control signals (depth, mask, camera, text) to edit subjects in customized videos.

Method: 1) VideoCus-Factory pipeline to create multi-subject training data from raw videos without labels; 2) Image-Video Transfer Mixed training with image editing data; 3) OmniVCus diffusion Transformer with Lottery Embedding (enables more subjects) and Temporally Aligned Embedding (extracts guidance from control signals).

Result: Significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations, enabling multi-subject video customization with control signal guidance.

Conclusion: The proposed framework successfully addresses multi-subject video customization challenges and enables instructive editing using various control signals, with released code, models, and data for community use.

Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code, models, data are released at https://github.com/caiyuanhao1998/Open-OmniVCus

[248] Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

Andrei Jelea, Ahmed Nabil Belbachir, Marius Leordeanu

Main category: cs.CV

TL;DR: GTTA is a general test-time augmentation method that works across vision and non-vision tasks by perturbing PCA subspace projections and using self-supervised learning to reduce computational cost.

Details

Motivation: Existing test-time augmentation methods are often task-specific and not generalizable across different domains. There's a need for an off-the-shelf TTA approach that works for various tasks like classification, regression, segmentation, and detection.

Method: GTTA uses random perturbations of PCA subspace projections to create diverse augmented samples from the data distribution. It includes a self-supervised learning stage where ensemble outputs act as an unsupervised teacher to train the initial single student model, reducing test-time computation.

Result: GTTA outperforms strong TTA approaches and state-of-the-art models on various vision and non-vision tasks including image classification, segmentation, pneumonia detection, speech recognition, and house price prediction. It also proves effective on salmon segmentation/detection in low-visibility underwater videos using the new DeepSalmon dataset.

Conclusion: GTTA is a highly effective and general test-time augmentation method that works across diverse tasks, validated on multiple datasets and real-world applications including underwater video analysis.

Abstract: We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA creates valid augmented samples from the data distribution with high diversity, properties we theoretically show that are essential for a Test-Time Augmentation method to be effective. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost. Our comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, pneumonia detection, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

[249] One Graph to Track Them All: Dynamic GNNs for Single- and Multi-View Tracking

Martin Engilberge, Ivan Vrkic, Friedrich Wilke Grosche, Julien Pilet, Engin Turetken, Pascal Fua

Main category: cs.CV

TL;DR: A unified differentiable model for multi-people tracking that learns to associate detections into trajectories using a dynamic spatiotemporal graph, with a new large-scale multi-view dataset.

Details

Motivation: Current multi-people tracking approaches often rely on pre-computed tracklets and struggle with occlusions. There's a need for a unified, differentiable model that can handle diverse conditions and improve occlusion handling.

Method: Builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across sequences. The graph can encode scene-specific information to handle occlusions better.

Result: Achieves state-of-the-art performance on public benchmarks and the new dataset, demonstrating flexibility across diverse conditions.

Conclusion: The proposed unified differentiable model with dynamic spatiotemporal graph representation effectively handles multi-people tracking, especially with occlusions. Both the model and new large-scale dataset will be publicly released to advance research.

Abstract: This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.

[250] SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, Lihua Xie

Main category: cs.CV

TL;DR: SplatSSC: A novel monocular 3D semantic scene completion framework using depth-guided Gaussian initialization and decoupled aggregation, achieving SOTA performance with improved efficiency.

Details

Motivation: Existing object-centric SSC methods using 3D Gaussian primitives suffer from inefficient random initialization and outlier primitives that cause erroneous artifacts, limiting performance and efficiency.

Method: 1) Depth-guided initialization with Group-wise Multi-scale Fusion (GMF) module for sparse representative Gaussian primitives; 2) Decoupled Gaussian Aggregator (DGA) separating geometric and semantic predictions; 3) Probability Scale Loss for optimization.

Result: Achieves state-of-the-art on Occ-ScanNet: +6.3% IoU and +4.1% mIoU improvements, while reducing latency and memory cost by over 9.3% compared to prior approaches.

Conclusion: SplatSSC effectively addresses initialization and outlier issues in Gaussian-based SSC through depth guidance and decoupled aggregation, delivering superior performance with enhanced efficiency.

Abstract: Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory cost by more than 9.3%.

[251] evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

Rodrigo Verschae, Ignacio Bugueno-Cordova

Main category: cs.CV

TL;DR: evTransFER: A transfer learning framework for facial expression recognition using event-based cameras, achieving state-of-the-art results on both synthetic and real neuromorphic datasets.

Details

Motivation: Event-based cameras offer unique advantages for capturing facial dynamics (microsecond latency, high temporal resolution, high dynamic range), but existing methods for facial expression recognition with these sensors need improvement. The authors aim to leverage transfer learning to better encode facial spatiotemporal dynamics.

Method: Proposes evTransFER framework with: 1) Feature extractor trained via adversarial generative method on facial reconstruction, then transferred to FER system; 2) Architecture incorporating LSTM to capture longer-term facial expression dynamics; 3) New event-based representation called TIE (Time-Integrated Events).

Result: Achieved 93.6% recognition rate on synthetic e-CK+ dataset (surpassing state-of-the-art) and 76.7% accuracy on real neuromorphic NEFER dataset with sensor noise and sparse activity. Outperformed both current methodologies and models trained from scratch in both datasets.

Conclusion: The transfer learning strategy effectively encodes facial spatiotemporal dynamics from event-based cameras, improving facial expression recognition performance compared to training from scratch, and works well on both synthetic and real-world noisy neuromorphic data.

Abstract: Event-based cameras are bio-inspired sensors that asynchronously capture pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing information on the spatiotemporal dynamics of a scene. We propose evTransFER, a transfer learning-based framework for facial expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode facial spatiotemporal dynamics, built by training an adversarial generative method on facial reconstruction and transferring the encoder weights to the facial expression recognition system. We demonstrate that the proposed transfer learning method improves facial expression recognition compared to training a network from scratch. We propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics and introduces a new event-based representation called TIE. We evaluated the framework using both the synthetic event-based facial expression database e-CK+ and the real neuromorphic dataset NEFER. On e-CK+, evTransFER achieved a recognition rate of 93.6%, surpassing state-of-the-art methods. For NEFER, which comprises event sequence with real sensor noise and sparse activity, the proposed transfer learning strategy achieved an accuracy of up to 76.7%. In both datasets, the outcomes surpassed current methodologies and exceeded results when compared with models trained from scratch.

[252] CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning

Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao

Main category: cs.CV

TL;DR: CVBench is a diagnostic benchmark for evaluating cross-video relational reasoning in multimodal LLMs, revealing significant performance gaps compared to humans.

Details

Motivation: Current MLLMs excel at single-video tasks but lack capability for spatiotemporal pattern reasoning across multiple videos, which is essential for real-world applications like multi-camera surveillance and cross-video procedural learning.

Method: Created CVBench with 1,000 QA pairs across three hierarchical tiers: cross-video object association, cross-video event association, and cross-video complex reasoning. Built from five domain-diverse video clusters and evaluated 10+ leading MLLMs under zero-shot or chain-of-thought prompting.

Result: Significant performance gaps: top models like GPT-4o achieve only 63.5% accuracy on causal reasoning vs. 91.3% human accuracy. Analysis reveals fundamental bottlenecks in current MLLM architectures, including deficient inter-video context retention and poor disambiguation of overlapping entities.

Conclusion: CVBench establishes a rigorous framework for advancing pattern recognition in multi-video scenarios and provides architectural insights for next-generation models. The benchmark reveals critical limitations in current MLLMs for cross-video relational reasoning.

Abstract: While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first diagnostic benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to analyze and integrate spatiotemporal patterns from dynamic visual streams. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 63.5% accuracy on causal reasoning tasks, compared to the 91.3% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLMs architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for advancing pattern recognition methodologies in multi-video scenarios, providing architectural insights for next-generation models. The data and evaluation code are available at: https://github.com/Hokhim2/CVBench.

[253] Aligned Anchor Groups Guided Line Segment Detector

Zeyu Li, Annan Shu

Main category: cs.CV

TL;DR: AAGLSD is a novel line segment detector using aligned anchor groups and hierarchical pixel extraction to achieve high precision and completeness without complex refinement.

Details

Motivation: To develop a line segment detector that can extract complete line segments from images with high precision, addressing limitations of existing methods in terms of completeness and avoiding complex refinement strategies.

Method: Uses hierarchical approach to extract candidate pixels with different saliency levels (regular anchors and aligned anchor groups). Starts from aligned anchor groups, sequentially links anchors while updating predicted line segments simultaneously. Final predictions come from simple validation and merging of adjacent segments.

Result: Quantitative experiments on various datasets show AAGLSD effectively extracts complete line segments compared to other advanced detectors. Implementation is publicly available.

Conclusion: AAGLSD successfully detects line segments with high precision and completeness using aligned anchor groups and simple validation/merging, outperforming existing methods without complex refinement.

Abstract: This paper introduces a novel line segment detector, the Aligned Anchor Groups guided Line Segment Detector (AAGLSD), designed to detect line segments from images with high precision and completeness. The algorithm employs a hierarchical approach to extract candidate pixels with different saliency levels, including regular anchors and aligned anchor groups. AAGLSD initiates from these aligned anchor groups, sequentially linking anchors and updating the currently predicted line segment simultaneously. The final predictions are derived through straightforward validation and merging of adjacent line segments, avoiding complex refinement strategies. AAGLSD is evaluated on various datasets and quantitative experiments demonstrate that the proposed method can effectively extract complete line segments from input images compared to other advanced line segment detectors. The implementation is available at https://github.com/zyl0609/AAGLSD.

[254] Bidirectional Sparse Attention for Faster Video Diffusion Training

Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang

Main category: cs.CV

TL;DR: BSA is a bidirectional sparse attention framework that dynamically sparsifies both queries and key-value pairs in video diffusion transformers, achieving up to 20x FLOPs reduction and 17.79x faster attention training while maintaining generative quality.

Details

Motivation: Video diffusion transformers suffer from quadratic complexity of full attention, leading to prohibitively high computational costs for high-resolution, long-duration videos. Full attention is inefficient due to excessive computation from sparse Q/KV pairs and redundant computation from fixed sparse patterns that don't leverage DiT's dynamic attention.

Method: Bidirectional Sparse Attention (BSA) framework with two key components: 1) Query sparsity optimization via semantic similarity selection of most informative query tokens with dynamic spatial-time training strategy, and 2) KV sparsity achieved by computing statistical dynamic thresholds to retain only the most salient KV blocks.

Result: BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

Conclusion: BSA provides an efficient solution to the computational bottlenecks in video diffusion transformers by dynamically sparsifying both queries and key-value pairs, enabling faster training and inference for high-resolution, long-duration video generation without sacrificing quality.

Abstract: Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT’s dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

[255] A Novel Compression Framework for YOLOv8: Achieving Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation

Melika Sabaghian, Mohammad Ali Keyvanrad, Seyyedeh Mahila Moghadami

Main category: cs.CV

TL;DR: A three-stage compression pipeline for YOLOv8 achieves 73.5% parameter reduction with minimal accuracy loss, enabling real-time aerial object detection on edge devices.

Details

Motivation: Deploying deep learning models for aerial object detection on resource-constrained devices requires significant compression without compromising performance, as current models are too large and computationally expensive for edge deployment.

Method: Three-stage compression pipeline: 1) Sparsity-aware training with dynamic sparsity during optimization, 2) Structured channel pruning using batch normalization scaling factors to eliminate redundant channels, 3) Channel-Wise Knowledge Distillation (CWD) with adjustable temperature and loss weighting to recover accuracy after pruning.

Result: For YOLOv8m: Parameters reduced from 25.85M to 6.85M (73.51% reduction), FLOPs from 49.6G to 13.3G, MACs from 101G to 34.5G, with only 2.7% AP50 drop. Inference speed improved from 26 FPS to 45 FPS, and with TensorRT optimization to 68 FPS (AP50 47.6).

Conclusion: The proposed compression pipeline effectively balances model size reduction and detection accuracy, enabling real-time aerial object detection on resource-constrained edge devices with practical deployment capabilities.

Abstract: Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without com-promising performance. In this study, we propose a novel three-stage compression pipeline for the YOLOv8 object detection model, integrating sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity during model optimization, effectively balancing parameter reduction and detection accuracy. Second, we apply structured channel pruning by leveraging batch normalization scaling factors to eliminate redundant channels, significantly reducing model size and computational complexity. Finally, to mitigate the accuracy drop caused by pruning, we employ CWD to transfer knowledge from the original model, using an adjustable temperature and loss weighting scheme tailored for small and medium object detection. Extensive experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to 13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The resulting compressed model achieves 47.9 AP50 and boosts inference speed from 26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge devices. We further apply TensorRT as a lightweight optimization step. While this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly improves inference speed from 45 to 68 FPS, demonstrating the practicality of our approach for high-throughput, re-source-constrained scenarios.

[256] Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

Junxiao Xue, Quan Deng, Xuecheng Wu, Kelu Yao, Xinyi Yin, Fei Yu, Wei Zhou, Yanfei Zhong, Yang Liu, Dingkang Yang

Main category: cs.CV

TL;DR: The paper introduces ChangeIMTI, a large-scale interactive multi-task instruction dataset for remote sensing change understanding, and proposes ChangeVG, a vision-guided vision-language model with dual-granularity awareness for bi-temporal remote sensing images.

Details

Motivation: Existing remote sensing change understanding datasets lack deep understanding and interactions across diverse tasks like change captioning, counting, and localization. There's a need for comprehensive datasets and models that can handle multiple complementary tasks in remote sensing change analysis.

Method: 1) Construct ChangeIMTI dataset covering four tasks: change captioning, binary change classification, change counting, and change localization. 2) Propose ChangeVG model with vision-guided module using dual-branch architecture combining fine-grained spatial features with high-level semantic summarization. 3) Use enriched representations as auxiliary prompts to guide large vision-language models during instruction tuning for hierarchical cross-modal learning.

Result: The method outperforms the strongest baseline Semantic-CC by 1.39 points on the comprehensive S*m metric for change captioning task. Extensive experiments across four tasks demonstrate superiority, with ablation studies validating critical components.

Conclusion: The proposed ChangeIMTI dataset and ChangeVG model effectively address limitations in remote sensing change understanding by providing comprehensive multi-task capabilities and achieving state-of-the-art performance through vision-guided hierarchical learning.

Abstract: Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method. The source code and associated data for this work are publicly available at Github.

[257] Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

Yifan Zhang, Wei Zhang, Chuangxin He, Zhonghua Miao, Junhui Hou

Main category: cs.CV

TL;DR: Proposes SFT3D: an unsupervised online 3D instance segmentation framework that improves upon UNIT by using synthetic point cloud sequences for training diversity, flexible temporal sampling, and dynamic-weighting loss.

Details

Motivation: Existing unsupervised 3D instance segmentation methods like UNIT have limitations: limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. Need better methods for consistent object tracking across LiDAR scans without annotated data.

Method: 1) Synthetic point cloud sequence generation for diverse training without manual labels or simulation engines; 2) Flexible temporal sampling using both adjacent and non-adjacent frames to capture long-range dependencies; 3) Dynamic-weighting loss that emphasizes confident and informative samples.

Result: Outperforms UNIT and other unsupervised baselines on SemanticKITTI, nuScenes, and PandaSet datasets. Achieves higher segmentation accuracy and more robust temporal associations.

Conclusion: The proposed SFT3D framework effectively addresses limitations of existing unsupervised 3D instance segmentation methods through synthetic data generation, flexible temporal modeling, and adaptive loss weighting, demonstrating superior performance across multiple benchmarks.

Abstract: Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

[258] Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Alexander Goslin

Main category: cs.CV

TL;DR: Terrain Diffusion: A generative framework using diffusion models for infinite, realistic terrain generation with procedural noise properties (seamless infinite extent, seed-consistency, constant-time access).

Details

Motivation: Procedural noise functions like Perlin noise have been limited in realism and large-scale coherence despite being fast and infinite. The paper aims to bridge diffusion model fidelity with procedural noise's essential properties for next-generation virtual worlds.

Method: Introduces InfiniteDiffusion algorithm for unbounded domain generation, hierarchical diffusion stack for planetary context with local detail, compact Laplacian encoding for Earth-scale dynamic range stability, and open-source infinite-tensor framework for constant-memory manipulation.

Result: The framework outperforms orbital velocity by 9x on consumer GPU, enabling realistic terrain generation at interactive rates while maintaining procedural noise properties like infinite extent and seed-consistency.

Conclusion: Terrain Diffusion positions diffusion models as a practical, scalable foundation for infinite virtual worlds, combining diffusion model fidelity with procedural noise’s essential properties.

Abstract: For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, a generative framework that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation that reformulates standard diffusion sampling for unbounded domains. While noise functions remain near-instant, our framework outpaces orbital velocity by 9 times on a consumer GPU, enabling realistic terrain generation at interactive rates. We integrate a hierarchical stack of diffusion models to couple planetary context with local detail, a compact Laplacian encoding to stabilize outputs across Earth-scale dynamic ranges, and an open-source infinite-tensor framework for constant-memory manipulation of unbounded tensors. Together, these components position diffusion models as a practical, scalable foundation for the next generation of infinite virtual worlds.

[259] Bringing The Consistency Gap: Explicit Structured Memory for Interleaved Image-Text Generation

Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

Main category: cs.CV

TL;DR: IUT-Plug introduces a neuro-symbolic structured state tracking mechanism using Image Understanding Trees to prevent multimodal context drift in Vision Language Models during extended image-text interactions.

Details

Motivation: Existing VLMs struggle to preserve logic, entity identity, and artistic style during extended interleaved image-text interactions due to "Multimodal Context Drift" - the decay or entanglement of implicit neural representations over long sequences.

Method: IUT-Plug uses Image Understanding Trees as explicit, persistent memory modules. It parses visual scenes into hierarchical symbolic structures (entities, attributes, relationships), performs incremental state updates to lock invariant properties while modifying changing elements, and guides generation through topological constraints.

Result: Evaluation on a novel benchmark of 3,000 human-annotated samples shows IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines.

Conclusion: Explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation, addressing limitations of purely neural approaches that rely on transient attention maps.

Abstract: Existing Vision Language Models (VLMs) often struggle to preserve logic, entity identity, and artistic style during extended, interleaved image-text interactions. We identify this limitation as “Multimodal Context Drift”, which stems from the inherent tendency of implicit neural representations to decay or become entangled over long sequences. To bridge this gap, we propose IUT-Plug, a model-agnostic Neuro-Symbolic Structured State Tracking mechanism. Unlike purely neural approaches that rely on transient attention maps, IUT-Plug introduces the Image Understanding Tree (IUT) as an explicit, persistent memory module. The framework operates by (1) parsing visual scenes into hierarchical symbolic structures (entities, attributes, and relationships); (2) performing incremental state updates to logically lock invariant properties while modifying changing elements; and (3) guiding generation through topological constraints. We evaluate our approach on a novel benchmark comprising 3,000 human-annotated samples. Experimental results demonstrate that IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines. This confirms that explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation.

[260] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

Main category: cs.CV

TL;DR: CoT-PL introduces visual chain-of-thought reasoning for open-vocabulary object detection, decomposing object understanding into three interpretable steps to improve pseudo-labeling quality in crowded/occluded scenes.

Details

Motivation: Current OVD methods rely on direct image-text matching, which neglects intermediate reasoning steps needed for complex scenes, resulting in limited robustness in crowded or occluded contexts.

Method: CoT-PL uses structured visual chain-of-thought reasoning with three steps: region perception for unseen objects, category recognition via zero-shot reasoning, and background grounding. This motivates contrastive background learning (CBL) that uses background cues as negatives for feature disentanglement.

Result: Achieves 103.4% and 168.4% relative improvements in novel-class pseudo-label quality in crowded/occluded scenes, +7.7 AP50 on open-vocabulary COCO, and +2.9 mask AP on LVIS for novel classes, setting new state-of-the-art.

Conclusion: CoT-PL demonstrates that integrating structured reasoning into pseudo-labeling significantly improves open-vocabulary detection robustness, especially in challenging crowded and occluded scenarios.

Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art. Code and models are available at https://github.com/hchoi256/cotpl.

[261] Space Object Detection using Multi-frame Temporal Trajectory Completion Method

Xiaoqing Lan, Biqiao Xin, Bingshu Wang, Han Zhang, Rui Zhu, Laixian Zhang

Main category: cs.CV

TL;DR: Proposes wavelet transform for single-frame GEO target enhancement and Hungarian algorithm-based multi-frame trajectory completion with post-processing for noise filtering and refinement.

Details

Motivation: GEO space objects are hard to detect in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference, requiring improved detection methods.

Method: 1) Wavelet transform for single-frame high-frequency feature enhancement and background noise suppression. 2) Multi-frame temporal trajectory completion using Hungarian algorithm for globally optimal cross-frame matching. 3) Post-processing pipeline with temporal matching/interpolation, temporal-consistency noise filtering, and progressive trajectory refinement.

Result: Achieved 90.14% F1 score on public SpotGEO dataset, demonstrating effectiveness of the proposed method.

Conclusion: The proposed approach effectively addresses GEO target detection challenges through wavelet-based feature enhancement and robust multi-frame trajectory completion with comprehensive post-processing.

Abstract: Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.

[262] VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng

Main category: cs.CV

TL;DR: VADTree: A training-free video anomaly detection method using hierarchical granularity-aware tree structure with adaptive sampling and LLM reasoning.

Details

Motivation: Supervised VAD methods require large labeled datasets and lack explainability, while existing training-free methods with fixed temporal windows struggle to capture anomalies of varying durations.

Method: Uses GEBD model to detect event boundaries, constructs HGTree with adaptive coarse-fine hierarchical structuring, injects multi-dimensional priors into VLMs for node anomaly perception, employs LLMs for anomaly reasoning, and integrates scores via inter-cluster correlation.

Result: Achieves SOTA performance in training-free settings on three challenging datasets while drastically reducing sampled video segments.

Conclusion: VADTree provides an effective training-free VAD solution with flexible temporal sampling, explainable anomaly detection, and computational efficiency.

Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.

[263] Towards Generalisable Foundation Models for Brain MRI

Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander

Main category: cs.CV

TL;DR: BrainFound is a self-supervised foundation model for 3D brain MRI that extends DINO-v2 to handle volumetric data, supports multimodal inputs, and outperforms existing methods in label-scarce settings.

Details

Motivation: Foundation models are transforming medical imaging, but existing approaches often treat MRI as 2D slices rather than full 3D volumes. There's a need for models that can handle 3D brain anatomy, work with multimodal inputs, and perform well in label-scarce clinical settings.

Method: Extends DINO-v2 (vision transformer) to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices. Supports both single- and multimodal inputs (T1, T2, FLAIR) and enables various downstream tasks like disease detection and segmentation.

Result: Consistently outperforms existing self-supervised pretraining strategies and supervised baselines, especially in label-scarce and multi-contrast settings. Enhances diagnostic accuracy and reduces dependency on extensive expert annotations.

Conclusion: BrainFound provides a scalable, practical solution for 3D neuroimaging pipelines with significant potential for clinical deployment and research innovation, offering flexibility across varied imaging protocols and clinical scenarios.

Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.

[264] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

Main category: cs.CV

TL;DR: EgoMAN: A reasoning-to-motion framework for 3D hand trajectory prediction using a new large-scale egocentric dataset with semantic QA pairs.

Details

Motivation: Existing 3D hand trajectory prediction methods are limited by datasets that separate motion from semantic supervision and models that weakly connect reasoning with action, lacking proper integration of semantic understanding with motion generation.

Method: 1) Create EgoMAN dataset with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. 2) Develop EgoMAN model as a reasoning-to-motion framework that links vision-language reasoning and motion generation through a trajectory-token interface, trained progressively to align reasoning with motion dynamics.

Result: The approach produces accurate and stage-aware trajectories with generalization capabilities across real-world scenes, overcoming previous limitations of decoupled motion and semantic supervision.

Conclusion: The EgoMAN framework successfully integrates semantic reasoning with motion generation for 3D hand trajectory prediction, enabling stage-aware trajectory prediction that generalizes to real-world scenarios through proper alignment of reasoning and motion dynamics.

Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

[265] Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann

Main category: cs.CV

TL;DR: Proposes a NAS search space for hybrid CNN-ViT architectures optimized for tinyML deployment, achieving better accuracy and speed than ResNet models under size constraints.

Details

Motivation: Hybrid CNN-ViT architectures outperform pure CNN or ViT but are too large/computationally expensive for tinyML deployment. Need efficient hybrid architectures that can run on resource-constrained devices.

Method: Introduces a new Neural Architecture Search (NAS) search space for hybrid CNN-ViT architectures. The search space includes: 1) Hybrid CNN and ViT blocks to learn both local and global information, 2) Novel Pooling block with searchable pooling layers for efficient feature map reduction. Applied to CIFAR10 dataset.

Result: The proposed search space produces hybrid CNN-ViT architectures that achieve superior accuracy and inference speed compared to ResNet-based tinyML models under tight model size constraints.

Conclusion: The novel NAS search space enables discovery of efficient hybrid CNN-ViT architectures suitable for tinyML deployment, balancing accuracy and computational efficiency for resource-constrained devices.

Abstract: Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

[266] FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI

Hao Li, Zhenfeng Zhuang, Jingyu Lin, Yu Liu, Yifei Chen, Qiong Peng, Lequan Yu, Liansheng Wang

Main category: cs.CV

TL;DR: FDP is a frequency-decomposition preprocessing framework that enhances unsupervised anomaly detection in brain MRI by leveraging frequency-domain analysis to suppress pathology while preserving normal anatomy.

Details

Motivation: Supervised anomaly detection for brain MRI is challenging due to anatomical diversity and scarce annotated data. Current unsupervised methods use artificial noise perturbations that lack biophysical fidelity and morphological complexity of real clinical lesions.

Method: Frequency-Decomposition Preprocessing (FDP) framework based on systematic frequency-domain analysis of pathological signatures. It leverages frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation, and can integrate with existing anomaly simulation techniques.

Result: FDP consistently improves anomaly detection performance across diverse architectures, achieving 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines.

Conclusion: FDP advances unsupervised anomaly detection in brain MRI by addressing limitations of simulated anomalies through frequency-domain analysis, offering a flexible preprocessing framework that enhances existing methods while maintaining diagnostic fidelity.

Abstract: Due to the diversity of brain anatomy and the scarcity of annotated data, supervised anomaly detection for brain MRI remains challenging, driving the development of unsupervised anomaly detection (UAD) approaches. Current UAD methods typically utilize artificially generated noise perturbations on healthy MRIs to train generative models for normal anatomy reconstruction, enabling anomaly detection via residual maps. However, such simulated anomalies lack the biophysical fidelity and morphological complexity characteristic of true clinical lesions. To advance UAD in brain MRI, we conduct the first systematic frequency-domain analysis of pathological signatures, revealing two key properties: (1) anomalies exhibit unique frequency patterns distinguishable from normal anatomy, and (2) low-frequency signals maintain consistent representations across healthy scans. These insights motivate our Frequency-Decomposition Preprocessing (FDP) framework, the first UAD method to leverage frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. FDP can integrate seamlessly with existing anomaly simulation techniques, consistently enhancing detection performance across diverse architectures while maintaining diagnostic fidelity. Experimental results demonstrate that FDP consistently improves anomaly detection performance when integrated with existing methods. Notably, FDP achieves a 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines. The code is available at https://github.com/ls1rius/MRI_FDP.

[267] Inference-based GAN Video Generation

Jingbo Yang, Adrian G. Bors

Main category: cs.CV

TL;DR: Proposes a VAE-GAN hybrid model with Markov chain recall mechanism for generating long videos (hundreds/thousands of frames) while maintaining temporal continuity and quality.

Details

Motivation: Existing video generation models (GANs, VAEs, Diffusion Networks) struggle with long sequences beyond 16 frames, suffering from degraded quality and lack of meaningful scene successions when scaling temporally.

Method: 1) Develops VAE-GAN hybrid with content/movement branches; 2) Extends with Markov chain framework where each state is a short VAE-GAN generator; 3) Uses recall mechanism to sequentially connect generated sub-sequences while maintaining temporal dependencies.

Result: Enables generation of long videos composed of hundreds or thousands of frames with temporal continuity, consistency, and dynamics, overcoming limitations of classical approaches.

Conclusion: The proposed memory-efficient approach successfully generates meaningful long video sequences by combining VAE-GAN architecture with Markov chain recall mechanism, addressing temporal scaling challenges in video generation.

Abstract: Video generation has seen remarkable progress thanks to advancements in generative deep learning. However, generating long sequences remains a significant challenge. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Models such as GANs, VAEs, and Diffusion Networks have been used for generating short video sequences, typically up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. Classical approaches often result in degraded video quality when attempting to increase the generated video length, especially for significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, where each state represents a short-length VAE-GAN video generator. This setup enables the sequential connection of generated video sub-sequences, maintaining temporal dependencies and resulting in meaningful long video sequences.

Mohammad Romani

Main category: cs.CV

TL;DR: ForensicFlow uses multi-domain fusion across visual, texture, and spectral dimensions with attention-based pooling to detect sophisticated deepfakes, achieving state-of-the-art performance on CelebDF(v2).

Details

Motivation: Modern deepfakes create subtle, domain-specific artifacts that single-branch detection networks miss, requiring a more comprehensive approach that examines multiple forensic dimensions simultaneously.

Method: Three-branch architecture: ConvNeXt-tiny for global visual inconsistencies, Swin Transformer-tiny for fine-grained texture anomalies, and CNN with channel attention for spectral noise patterns. Uses attention-based temporal pooling to prioritize high-evidence frames and adaptive fusion to weight branches according to forgery type. Trained with Focal Loss on CelebDF(v2).

Result: Achieves AUC 0.9752, F1 0.9408, and accuracy 0.9208 on CelebDF(v2), outperforming single-stream detectors. Ablation studies confirm branch synergy, and Grad-CAM visualizations show focus on genuine manipulation regions like facial boundaries.

Conclusion: Multi-domain fusion strategy provides robustness against increasingly sophisticated forgeries by comprehensively examining visual, texture, and spectral artifacts, establishing a more effective approach to deepfake detection.

Abstract: Modern deepfakes evade detection by leaving subtle, domain-speci c artifacts that single branch networks miss. ForensicFlow addresses this by fusing evidence across three forensic dimensions: global visual inconsistencies (via ConvNeXt-tiny), ne-grained texture anomalies (via Swin Transformer-tiny), and spectral noise patterns (via CNN with channel attention). Our attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive fusion weights each branch according to forgery type. Trained on CelebDF(v2) with Focal Loss, the model achieves AUC 0.9752, F1 0.9408, and accuracy 0.9208 out performing single-stream detectors. Ablation studies con rm branch synergy, and Grad-CAM visualizations validate focus on genuine manipulation regions (e.g., facial boundaries). This multi-domain fusion strategy establishes robustness against increasingly sophisticated forgeries.

[269] BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction

Zhengsen Xu, Sibo Cheng, Lanying Wang, Hongjie He, Wentao Sun, Jonathan Li, Lincoln Linlin Xu

Main category: cs.CV

TL;DR: A 25-year daily wildfire dataset covering 240M hectares with 38 covariates is introduced to benchmark time-series forecasting models for wildfire risk prediction.

Details

Motivation: Wildfire risk prediction is challenging due to complex interactions among multiple factors, and there's a scarcity of publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multimodal drivers.

Method: Created a comprehensive 25-year, daily-resolution wildfire dataset covering 240 million hectares across British Columbia and surrounding regions with 38 covariates including active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Used this benchmark to evaluate diverse time-series forecasting models including CNN-based, linear-based, Transformer-based, and Mamba-based architectures.

Result: The paper presents a publicly available benchmark dataset and evaluates various forecasting models, investigating the effectiveness of position embedding and the relative importance of different fire-driving factors.

Conclusion: The introduced dataset addresses the scarcity of comprehensive wildfire benchmarks and enables evaluation of diverse time-series forecasting models for wildfire risk prediction, with code and data publicly available.

Abstract: Wildfire risk prediction remains a critical yet challenging task due to the complex interactions among fuel conditions, meteorology, topography, and human activity. Despite growing interest in data-driven approaches, publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multimodal drivers remain scarce. To address this gap, we present a 25-year, daily-resolution wildfire dataset covering 240 million hectares across British Columbia and surrounding regions. The dataset includes 38 covariates, encompassing active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Using this benchmark, we evaluate a diverse set of time-series forecasting models, including CNN-based, linear-based, Transformer-based, and Mamba-based architectures. We also investigate effectiveness of position embedding and the relative importance of different fire-driving factors. The dataset and the corresponding code can be found at https://github.com/SynUW/mmFire

[270] SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems

Khalfalla Awedat, Mohamed Abidalrekab, Gurcan Comert, Mustafa Ayad

Main category: cs.CV

TL;DR: SuperiorGAT is a graph attention framework that reconstructs missing elevation data in sparse LiDAR point clouds using beam-aware graphs and gated residual fusion, achieving better reconstruction than PointNet and deeper GAT models without increasing network depth.

Details

Motivation: LiDAR perception in autonomous systems faces limitations due to fixed vertical beam resolution and beam dropout from environmental occlusions, which degrade point cloud quality and affect perception accuracy.

Method: Models LiDAR scans as beam-aware graphs, uses graph attention-based framework with gated residual fusion and feed-forward refinement to reconstruct missing elevation information without increasing network depth.

Result: SuperiorGAT achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines across diverse KITTI environments (Person, Road, Campus, City sequences).

Conclusion: Architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware, preserving structural integrity with minimal vertical distortion.

Abstract: LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model’s ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.

[271] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

Main category: cs.CV

TL;DR: ∞-RoPE is a training-free framework that enables infinite-horizon video generation with fine-grained action control and cinematic scene transitions by addressing three core bottlenecks in autoregressive video diffusion models.

Details

Motivation: Current autoregressive video diffusion models have three key limitations: (1) finite temporal horizon due to 3D-RoPE constraints, (2) slow prompt responsiveness for maintaining action control during long rollouts, and (3) inability to create discontinuous cinematic transitions within a single generation.

Method: ∞-RoPE introduces three interconnected components: Block-Relativistic RoPE (reformulates temporal encoding as moving local reference frames), KV Flush (renews KV cache by keeping only global sink and last frame), and RoPE Cut (introduces controlled discontinuities for scene transitions).

Result: The framework enables continuous video generation beyond positional limits, immediate prompt responsiveness, and multi-cut scene transitions within single rollouts, consistently surpassing previous autoregressive models in VBench scores.

Conclusion: ∞-RoPE establishes a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion, addressing all three core bottlenecks of current autoregressive video generation models.

Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model’s 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model’s maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

[272] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

Wenyuan Huang, Zhao Wang, Zhou Wei, Ting Huang, Fang Zhao, Jian Yang, Zhenyu Zhang

Main category: cs.CV

TL;DR: OpenGround: A zero-shot framework for open-world 3D visual grounding that overcomes limitations of pre-defined object lookup tables through active cognition-based reasoning.

Details

Motivation: Existing 3D visual grounding methods rely on pre-defined Object Lookup Tables (OLTs) to query VLMs, which limits applications in scenarios with undefined or unforeseen targets. This restricts the ability to handle open-world scenarios.

Method: Proposes OpenGround with Active Cognition-based Reasoning (ACR) module that progressively augments VLM cognitive scope. ACR performs human-like perception via cognitive task chain and actively reasons about contextually relevant objects through dynamically updated OLT, enabling both pre-defined and open-world category handling.

Result: Achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers 17.6% improvement on their new OpenTarget dataset containing 7000+ object-description pairs for open-world evaluation.

Conclusion: OpenGround enables zero-shot open-world 3D visual grounding by overcoming fundamental limitations of pre-defined OLTs through active cognition-based reasoning, demonstrating strong performance across standard and open-world benchmarks.

Abstract: 3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at https://why-102.github.io/openground.io/.

[273] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang, Yadan Luo

Main category: cs.CV

TL;DR: Proposes TALO, a plug-and-play framework for temporal alignment of 3D vision foundation model predictions using Thin Plate Spline with globally propagated control points and point-agnostic submap registration.

Details

Motivation: Existing 3D vision foundation models lack temporal consistency when deployed in online settings like driving scenarios, and current alignment strategies have limitations in assumption validity, local alignment scope, and robustness to noisy geometry.

Method: Uses Thin Plate Spline with globally propagated control points for higher-DOF, long-term alignment; employs point-agnostic submap registration design for robustness to noisy geometry; fully plug-and-play framework compatible with diverse 3D foundation models and camera configurations.

Result: Demonstrates more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups (monocular/surround-view), highlighting robustness and generality.

Conclusion: Proposed TALO framework effectively addresses temporal inconsistency in 3D vision foundation models, offering a robust, generalizable solution for online deployment scenarios.

Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at https://github.com/Xian-Bei/TALO.

[274] SoulX-LiveTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

Le Shen, Qiao Qian, Tan Yu, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Ming Tao, Shunshun Yin, Siyuan Liu

Main category: cs.CV

TL;DR: 14B-parameter framework for real-time audio-driven avatar generation using bidirectional attention distillation and self-correction mechanisms to achieve sub-second latency and 32 FPS throughput.

Details

Motivation: Existing approaches for real-time audio-driven avatar generation compromise visual fidelity by using strictly unidirectional attention or reducing model capacity due to computational load vs. latency constraints.

Method: Introduces SoulX-LiveTalk with Self-correcting Bidirectional Distillation (retains bidirectional attention within video chunks) and Multi-step Retrospective Self-Correction Mechanism for error recovery. Includes full-stack inference acceleration with hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations.

Result: Achieves sub-second start-up latency (0.87s) and real-time throughput of 32 FPS, setting new standard for high-fidelity interactive digital human synthesis at 14B scale.

Conclusion: SoulX-LiveTalk successfully addresses the engineering challenge of real-time, infinite-duration audio-driven avatar generation by balancing computational efficiency with visual fidelity through innovative bidirectional attention preservation and self-correction mechanisms.

Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

[275] DiRe: Diversity-promoting Regularization for Dataset Condensation

Saumyaranjan Mohanty, Aravind Reddy, Konda Reddy Mopuri

Main category: cs.CV

TL;DR: Proposes Diversity Regularizer (DiRe) to reduce redundancy and improve diversity in dataset condensation by combining cosine similarity and Euclidean distance metrics.

Details

Motivation: Existing dataset condensation methods produce synthesized datasets with significant redundancy, creating a need to reduce redundancy and improve diversity in the condensed datasets.

Method: Introduces a Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance metrics that can be applied off-the-shelf to various state-of-the-art condensation methods.

Result: The regularizer improves state-of-the-art condensation methods on benchmark datasets from CIFAR-10 to ImageNet-1K, enhancing both generalization and diversity metrics.

Conclusion: DiRe is an effective, intuitive regularizer that can be easily integrated into existing condensation methods to reduce redundancy and improve dataset diversity.

Abstract: In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.

[276] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

Main category: cs.CV

TL;DR: Training-free approach improves subject consistency in text-to-image diffusion models by refining text embeddings to suppress semantic entanglement.

Details

Motivation: Text-to-image diffusion models struggle with preserving subject consistency across multiple outputs for visual storytelling, and existing approaches require computationally expensive fine-tuning or per-subject optimization.

Method: Proposes a simple training-free approach that refines text embeddings from a geometric perspective to suppress unwanted semantics and address semantic entanglement between frames.

Result: Extensive experiments show the approach significantly improves both subject consistency and text alignment over existing baselines like 1Prompt1Story.

Conclusion: The training-free geometric refinement of text embeddings effectively addresses semantic leakage and improves subject consistency in visual storytelling applications.

Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[277] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: RxnBench is a new benchmark for evaluating Multimodal Large Language Models on chemical reaction understanding from scientific PDFs, revealing significant gaps in chemical logic and structural recognition despite some success with inference-time reasoning.

Details

Motivation: While MLLMs show promise for revolutionizing chemistry, their ability to understand the dense graphical language of chemical reactions in real scientific literature remains underexplored and needs rigorous evaluation.

Method: Created RxnBench with two tasks: Single-Figure QA (1,525 questions from 305 reaction schemes) testing visual perception and mechanistic reasoning, and Full-Document QA (108 articles) requiring cross-modal integration of text, schemes, and tables.

Result: MLLMs show critical capability gaps - they excel at text extraction but struggle with deep chemical logic and precise structural recognition. Models with inference-time reasoning outperform standard architectures, but none achieve 50% accuracy on FD-QA.

Conclusion: There’s an urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists, as current MLLMs lack the necessary chemical understanding despite their multimodal capabilities.

Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[278] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

Main category: cs.CV

TL;DR: DAVE is a specialized vision encoder for VLMs that addresses the lack of structural/spatial information in existing encoders for document understanding and web agent tasks, using self-supervised pretraining on unlabeled data followed by supervised training with novel model-merging and ensemble techniques.

Details

Motivation: Current vision-language models have a fundamental weakness: their vision encoders lack robust structural and spatial information essential for document understanding and web agent tasks. This gap limits their effectiveness in these important application domains.

Method: Two-stage training: 1) Self-supervised pretraining on unlabeled images to avoid costly annotations, 2) Supervised autoregressive pretraining on limited high-quality data for parsing and localization. Uses novel model-merging scheme to combine encoders trained with different text decoders, and ensemble training to fuse features from generalist encoders (SigLIP2) with document/web-specific representations.

Result: Extensive experiments on document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of DAVE, establishing it as a strong vision encoder for document and web applications.

Conclusion: DAVE successfully bridges the gap in vision encoders for VLMs, providing robust structural and spatial information essential for document understanding and web agent tasks through innovative training strategies that leverage both unlabeled data and limited high-quality annotations.

Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder’s alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

[279] ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo, Cen Chen

Main category: cs.CV

TL;DR: ProCache is a training-free dynamic feature caching framework that accelerates Diffusion Transformers by using non-uniform caching intervals and selective computation to reduce error accumulation.

Details

Motivation: Diffusion Transformers (DiTs) have high computational costs that hinder real-time deployment. Existing feature caching methods use uniform intervals that don't align with DiT's non-uniform temporal dynamics, and naive feature reuse with large intervals causes severe error accumulation.

Method: ProCache has two core components: (1) constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to model’s temporal characteristics; (2) selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead.

Result: ProCache achieves up to 1.96x and 2.90x acceleration on PixArt-alpha and DiT with negligible quality degradation, significantly outperforming prior caching-based methods.

Conclusion: ProCache effectively addresses limitations of existing feature caching methods for DiTs by aligning caching patterns with temporal dynamics and mitigating error accumulation, enabling efficient real-time deployment of diffusion transformers.

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

[280] GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

Tiantian Li, Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He, Jun Zhang, Yan Wang

Main category: cs.CV

TL;DR: GaussianImage++ improves image representation and compression using limited Gaussian primitives with distortion-driven densification, context-aware filters, and efficient quantization, outperforming previous GS and INR methods.

Details

Motivation: Implicit neural representations (INRs) require substantial training time and memory, while existing 2D Gaussian Splatting methods need excessive primitives for high visual fidelity. There's a need for efficient GS-based approaches that use limited primitives while maintaining performance.

Method: 1) Distortion-driven densification mechanism that progressively allocates Gaussian primitives according to signal intensity. 2) Context-aware Gaussian filters for each primitive to optimize based on varying image content. 3) Attribute-separated learnable scalar quantizers with quantization-aware training for efficient compression of primitive attributes.

Result: Outperforms GaussianImage and INRs-based COIN in both representation and compression performance while maintaining real-time decoding and low memory usage.

Conclusion: GaussianImage++ successfully exploits the potential of GS-based approaches by using limited Gaussian primitives to achieve impressive representation and compression performance with practical advantages in decoding speed and memory efficiency.

Abstract: Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.

[281] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Zhenhao Li, Shaohan Yi, Zheng Liu, Leonartinus Gao, Minh Ngoc Le, Ambrose Ling, Zhuoran Wang, Md Amirul Islam, Zhixiang Chi, Yuanhao Yu

Main category: cs.CV

TL;DR: MIVA is a lightweight modular adapter system for diffusion models that enables precise motion control for image animation using minimal training data, overcoming memorization and generalization limitations of current approaches.

Details

Motivation: Diffusion models struggle with image animation due to video's high dimensionality causing data scarcity, leading to memorization over prompt compliance, and poor generalization to novel motion patterns not in training data.

Method: Proposes Modular Image-to-Video Adapter (MIVA) - lightweight sub-networks attachable to pre-trained DMs, each capturing a single motion pattern, scalable via parallelization, trainable on ~10 samples with consumer-grade GPU.

Result: MIVA enables more precise motion control while maintaining or surpassing generation quality of models trained on much larger datasets, eliminating need for prompt engineering at inference.

Conclusion: MIVA addresses key limitations of diffusion models for image animation through modular, data-efficient adapters that provide precise motion control without extensive training data or prompt engineering.

Abstract: Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.

[282] SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration

Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang

Main category: cs.CV

TL;DR: SlideChain is a blockchain-based framework that provides verifiable integrity for AI-generated educational content by anchoring semantic extractions from vision-language models on a tamper-evident blockchain.

Details

Motivation: Current vision-language models produce semantic outputs that are difficult to verify, reproduce, and audit over time, especially in high-stakes STEM education where inconsistencies across models and environments undermine reliability.

Method: Developed SlideChain framework that extracts concepts and relational triples from medical imaging lecture slides using four state-of-the-art VLMs, creates structured provenance records, and anchors cryptographic hashes on a local EVM-compatible blockchain.

Result: Analysis revealed significant cross-model discrepancies (low concept overlap, near-zero agreement in relational triples), while SlideChain demonstrated perfect tamper detection, deterministic reproducibility, and practical scalability with evaluated gas usage and throughput.

Conclusion: SlideChain provides a practical, scalable solution for trustworthy multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity in AI-assisted instructional systems.

Abstract: Modern vision–language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

[283] SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: SyncAnyone: A two-stage diffusion framework for high-quality AI video dubbing with accurate lip sync and visual fidelity.

Details

Motivation: Existing mask-based training methods for video dubbing disrupt spatiotemporal context, causing instability in facial structure and background consistency while focusing on lip-sync accuracy.

Method: Two-stage learning framework: Stage 1 trains diffusion-based video transformer for masked mouth inpainting; Stage 2 uses mask-free tuning with synthetic data generation to address mask-induced artifacts.

Result: Achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation in wild lip-syncing scenarios.

Conclusion: SyncAnyone overcomes limitations of mask-based methods by combining accurate motion modeling with high visual fidelity through a novel two-stage approach.

Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

[284] Tracking by Predicting 3-D Gaussians Over Time

Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik

Main category: cs.CV

TL;DR: Video-GMAE is a self-supervised video representation learning method that encodes video frames as moving Gaussian splats, enabling emergent tracking capabilities and strong performance on video tasks.

Details

Motivation: The paper aims to develop a self-supervised approach for video representation learning that leverages the natural 3D structure of scenes. The key insight is that 2D videos often represent consistent projections of dynamic 3D scenes, and representing videos as sets of moving Gaussians provides a reasonable inductive bias that can lead to emergent tracking capabilities.

Method: Video-GMAE encodes a sequence of images into a set of Gaussian splats that move over time. The architecture uses a masked autoencoder framework where the model learns to reconstruct masked portions of video frames by representing them as dynamic Gaussian distributions. This representation enforces the inductive bias that videos are projections of 3D scenes, and the Gaussian trajectories naturally capture object motion.

Result: The method demonstrates that tracking emerges naturally during pretraining. Mapping Gaussian trajectories onto the image plane achieves zero-shot tracking performance comparable to state-of-the-art methods. With finetuning, the models achieve significant improvements: 34.6% on Kinetics and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

Conclusion: Video-GMAE successfully demonstrates that representing videos as moving Gaussian splats provides an effective self-supervised learning framework that naturally leads to emergent tracking capabilities and achieves strong performance on video understanding tasks, outperforming existing self-supervised methods.

Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.

[285] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

ZhenQi Chen, TsaiChing Ni, YuanFu Yang

Main category: cs.CV

TL;DR: CritiFusion improves text-to-image diffusion models by adding semantic critique and frequency-domain refinement at inference time without retraining.

Details

Motivation: Current text-to-image diffusion models achieve good visual quality but often fail to semantically align with complex prompts, needing better prompt fidelity.

Method: CritiFusion uses CritiCore (VLM+LLMs) for semantic feedback and SpecFusion for spectral domain merging of intermediate states to refine structure while preserving details.

Result: Improves text-image correspondence metrics, human preference scores, and aesthetic evaluations, matching state-of-the-art reward optimization methods.

Conclusion: The semantic critique and spectral alignment strategy effectively enhances detail, realism, and prompt fidelity as a plug-in refinement for existing diffusion models.

Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt’s intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

[286] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin

Main category: cs.CV

TL;DR: This paper introduces an adaptive visual token pruning method for Large Multimodal Models that reduces inference costs by up to 80% while maintaining performance in long context, multi-image scenarios.

Details

Motivation: The growing number of visual tokens in LMMs greatly increases inference costs, especially in scenarios with long context inputs containing multiple images. Existing pruning methods often overlook these multi-image settings, creating a need for specialized approaches.

Method: A two-stage adaptive pruning method: 1) Intra-image stage allocates content-aware token budgets per image and greedily selects most representative tokens. 2) Inter-image stage performs global diversity filtering to form a candidate pool, then applies Pareto selection balancing diversity with text alignment.

Result: Extensive experiments show the approach can reduce up to 80% of visual tokens while maintaining performance in long context settings with multiple images.

Conclusion: The proposed adaptive pruning method effectively addresses the challenges of visual token pruning in long context, multi-image scenarios by decomposing redundancy into intra-image and inter-image components and dynamically allocating token budgets.

Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.

[287] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li

Main category: cs.CV

TL;DR: ColaVLA is a vision-language-action framework for autonomous driving that transfers VLM reasoning to a latent space and uses hierarchical parallel planning to generate trajectories efficiently with only two VLM forward passes.

Details

Motivation: Current VLM-based planners face challenges: mismatch between discrete text reasoning and continuous control, high latency from autoregressive chain-of-thought decoding, and inefficient/non-causal planners limiting real-time deployment.

Method: Two main components: 1) Cognitive Latent Reasoner compresses scene understanding into decision-oriented meta-action embeddings via ego-adaptive selection with only two VLM forward passes. 2) Hierarchical Parallel Planner generates multi-scale, causality-consistent trajectories in a single forward pass.

Result: Achieves state-of-the-art performance on nuScenes benchmark in both open-loop and closed-loop settings with favorable efficiency and robustness.

Conclusion: ColaVLA preserves VLM generalization and interpretability while enabling efficient, accurate, and safe trajectory generation for autonomous driving.

Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng

Main category: cs.CV

TL;DR: PoseStreamer: A robust multi-modal 6DoF pose estimation framework using event cameras for high-speed moving objects, featuring temporal consistency, 2D tracking priors, and geometric refinement.

Details

Motivation: Standard RGB cameras struggle with motion blur in high-speed and low-light scenarios for 6DoF pose estimation. Event cameras offer high temporal resolution but current methods have suboptimal performance for fast-moving objects.

Method: Three core components: 1) Adaptive Pose Memory Queue for temporal consistency using historical orientation cues, 2) Object-centric 2D Tracker providing strong 2D priors to boost 3D center recall, 3) Ray Pose Filter for geometric refinement along camera rays. Also introduces MoCapCube6D dataset for benchmarking.

Result: PoseStreamer achieves superior accuracy in high-speed moving scenarios and exhibits strong generalizability as a template-free framework for unseen moving objects.

Conclusion: The proposed multi-modal framework effectively addresses 6DoF pose estimation challenges in high-speed scenarios using event cameras, with demonstrated robustness and generalizability.

Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

[289] YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Xu Lin, Jinlong Peng, Zhenye Gan, Jiawen Zhu, Jun Liu

Main category: cs.CV

TL;DR: YOLO-Master introduces instance-conditional adaptive computation for real-time object detection using Efficient Sparse Mixture-of-Experts to dynamically allocate resources based on scene complexity, achieving better accuracy and speed than YOLOv13-N.

Details

Motivation: Existing YOLO-like real-time object detection models use static dense computation that applies uniform processing to all inputs, causing computational redundancy on simple scenes and insufficient processing for complex scenes, leading to both inefficiency and suboptimal performance.

Method: Proposes YOLO-Master framework with Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources per input based on scene complexity. Uses lightweight dynamic routing network with diversity enhancing objective to encourage complementary expertise among experts, activating only relevant experts during inference.

Result: Achieves 42.4% AP with 1.62ms latency on MS COCO, outperforming YOLOv13-N by +0.8% mAP with 17.8% faster inference. Gains are most pronounced on challenging dense scenes while maintaining efficiency on typical inputs and real-time inference speed.

Conclusion: YOLO-Master successfully addresses the static computation limitation of existing RTOD methods through instance-conditional adaptive computation, demonstrating superior performance-efficiency trade-off, particularly for complex scenes, while maintaining real-time capabilities.

Abstract: Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

[290] Visual Language Hypothesis

Xiu Li

Main category: cs.CV

TL;DR: Visual understanding requires semantic language; visual space has fiber bundle structure where nuisance variation is in fibers and semantics in quotient base; semantic invariance needs discriminative targets, not just smooth deformation; architecture must support topology change via “expand and snap” process.

Details

Motivation: To understand visual representation learning from structural/topological perspective, starting from hypothesis that visual understanding requires semantic language where many perceptual observations map to few discrete semantic states.

Method: Theoretical framework based on fiber bundle structure: visual observation space organized as fiber bundle (nuisance variation in fibers, semantics in quotient base space). Derives consequences: semantic quotient not submanifold, requires discriminative targets; architecture must support topology change via “expand and snap” process.

Result: Two theoretical consequences: 1) Semantic invariance requires non-homeomorphic, discriminative targets (labels, cross-instance identification, multimodal alignment) not just smooth deformation. 2) Architecture must support topology change - “expand and snap” process where manifold is expanded to separate structure then collapsed to form discrete semantic regions.

Conclusion: Framework provides topological lens aligning with empirical regularities in large-scale discriminative/multimodal models and classical statistical learning principles. Results are interpretive rather than prescriptive, offering structural understanding of visual representation learning.

Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient X/G is not a submanifold of X and cannot be obtained through smooth deformation alone, semantic invariance requires a non homeomorphic, discriminative target for example, supervision via labels, cross-instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand and snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

[291] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: DriveLaW is a unified paradigm that integrates video generation (world modeling) and motion planning for autonomous driving, achieving state-of-the-art results in both tasks through latent representation sharing and progressive training.

Details

Motivation: Current autonomous driving world models operate in decoupled architectures where world prediction and motion planning remain separate processes, limiting their effectiveness in addressing real-world long-tail challenges. The authors aim to bridge this gap by creating a truly unified approach.

Method: DriveLaW consists of two core components: DriveLaW-Video (a powerful world model for high-fidelity future forecasting with expressive latent representations) and DriveLaW-Act (a diffusion planner that generates consistent trajectories from DriveLaW-Video’s latent representations). Both components are optimized using a three-stage progressive training strategy.

Result: DriveLaW achieves new state-of-the-art results: 1) Video prediction surpasses best-performing work by 33.3% in FID and 1.8% in FVD; 2) Sets a new record on the NAVSIM planning benchmark for autonomous driving motion planning.

Conclusion: The unified paradigm of DriveLaW demonstrates that directly injecting video generator latent representations into the planner ensures inherent consistency between high-fidelity future generation and reliable trajectory planning, advancing both world modeling and motion planning capabilities for autonomous driving.

Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

[292] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Kang Du, Yirui Guan, Zeyu Wang

Main category: cs.CV

TL;DR: IDT is a transformer-based feed-forward framework for multi-view intrinsic image decomposition that produces view-consistent diffuse reflectance, diffuse shading, and specular shading in a single forward pass.

Details

Motivation: RGB images entangle material properties, illumination, and view-dependent effects, making intrinsic decomposition fundamental for visual understanding. While recent diffusion-based methods work for single-view, extending to multi-view settings leads to severe view inconsistency problems.

Method: IDT uses transformer-based attention to jointly reason over multiple input images, adopting a physically grounded image formation model that explicitly decomposes images into three components: diffuse reflectance, diffuse shading, and specular shading. This separates Lambertian and non-Lambertian light transport without iterative generative sampling.

Result: Experiments on synthetic and real-world datasets show IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, better-isolated specular components, and substantially improved multi-view consistency compared to prior methods.

Conclusion: IDT provides an effective transformer-based solution for multi-view intrinsic decomposition that produces interpretable, controllable, and view-consistent decompositions of material and illumination effects across multiple views.

Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

cs.AI

[293] The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models

Rahul Baxi

Main category: cs.AI

TL;DR: The paper introduces DDFT, a new evaluation protocol that measures epistemic robustness - how well language models maintain factual accuracy under semantic compression and adversarial fabrication, revealing that robustness is orthogonal to conventional metrics like model size.

Details

Motivation: Current language model evaluations (like MMLU, TruthfulQA) measure knowledge under ideal conditions but fail to assess robustness under realistic stress - they can't distinguish models that lack knowledge from those whose verification mechanisms collapse when information degrades or under adversarial probing.

Method: Introduces the Drill-Down and Fabricate Test (DDFT) protocol that measures epistemic robustness through progressive semantic compression and adversarial fabrication. Proposes a two-system cognitive model: Semantic System (generates fluent text) and Epistemic Verifier (validates factual accuracy). Evaluated 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations).

Result: Epistemic robustness is orthogonal to conventional design paradigms: neither parameter count (r=0.083, p=0.832) nor architectural type (r=0.153, p=0.695) significantly predicts robustness. Error detection capability strongly predicts overall robustness (rho=-0.817, p=0.007). Flagship models exhibit brittleness despite scale, while smaller models can achieve robust performance, challenging assumptions about model size and reliability.

Conclusion: The DDFT framework provides both theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications. Robustness emerges from training methodology and verification mechanisms distinct from current approaches, with error detection capability being the critical bottleneck.

Abstract: Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress. Static benchmarks like MMLU and TruthfulQA cannot distinguish a model that lacks knowledge from one whose verification mechanisms collapse when information degrades or adversaries probe for weaknesses. We introduce the Drill-Down and Fabricate Test (DDFT), a protocol that measures epistemic robustness: a model’s ability to maintain factual accuracy under progressive semantic compression and adversarial fabrication. We propose a two-system cognitive model comprising a Semantic System that generates fluent text and an Epistemic Verifier that validates factual accuracy. Our findings, based on evaluating 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations), reveal that epistemic robustness is orthogonal to conventional design paradigms. Neither parameter count (r=0.083, p=0.832) nor architectural type (r=0.153, p=0.695) significantly predicts robustness, suggesting it emerges from training methodology and verification mechanisms distinct from current approaches. Error detection capability strongly predicts overall robustness (rho=-0.817, p=0.007), indicating this is the critical bottleneck. We find that flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance, challenging assumptions about the relationship between model size and reliability. The DDFT framework provides both theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications.

[294] CASCADE: Cumulative Agentic Skill Creation through Autonomous Development and Evolution

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, Gerbrand Ceder

Main category: cs.AI

TL;DR: CASCADE is a self-evolving LLM agent framework that transitions from tool use to skill acquisition, enabling agents to master complex scientific tools and accumulate executable skills through continuous learning and self-reflection.

Details

Motivation: Current LLM agents rely on predefined tools or brittle tool generation, limiting their capability and adaptability to complex scientific tasks. There's a need for agents that can autonomously learn and master external tools rather than just using them.

Method: CASCADE enables agents to develop two meta-skills: 1) continuous learning via web search and code extraction, and 2) self-reflection via introspection and knowledge graph exploration. The framework allows agents to accumulate executable skills that can be shared across agents and scientists.

Result: On SciSkillBench (116 materials science and chemistry tasks), CASCADE achieves 93.3% success rate using GPT-5, compared to 35.4% without evolution mechanisms. The framework demonstrates real-world applications in computational analysis, autonomous lab experiments, and selective paper reproduction.

Conclusion: CASCADE represents a transition from “LLM + tool use” to “LLM + skill acquisition,” enabling scalable AI-assisted scientific research through human-agent collaboration, memory consolidation, and shareable executable skills.

Abstract: Large language model (LLM) agents currently depend on predefined tools or brittle tool generation, constraining their capability and adaptability to complex scientific tasks. We introduce CASCADE, a self-evolving agentic framework representing an early instantiation of the transition from “LLM + tool use” to “LLM + skill acquisition”. CASCADE enables agents to master complex external tools and codify knowledge through two meta-skills: continuous learning via web search and code extraction, and self-reflection via introspection and knowledge graph exploration, among others. We evaluate CASCADE on SciSkillBench, a benchmark of 116 materials science and chemistry research tasks. CASCADE achieves a 93.3% success rate using GPT-5, compared to 35.4% without evolution mechanisms. We further demonstrate real-world applications in computational analysis, autonomous laboratory experiments, and selective reproduction of published papers. Along with human-agent collaboration and memory consolidation, CASCADE accumulates executable skills that can be shared across agents and scientists, moving toward scalable AI-assisted scientific research.

[295] SCP: Accelerating Discovery with a Global Web of Autonomous Scientific Agents

Yankai Jiang, Wenjie Lou, Lilong Wang, Zhenyu Tang, Shiyang Feng, Jiaxuan Lu, Haoran Sun, Yaning Pan, Shuang Gu, Haoyang Su, Feng Liu, Wangxu Wei, Pan Tan, Dongzhan Zhou, Fenghua Ling, Cheng Tan, Bo Zhang, Xiaosong Wang, Lei Bai, Bowen Zhou

Main category: cs.AI

TL;DR: SCP (Science Context Protocol) is an open-source standard for autonomous scientific agents that provides unified resource integration and orchestrated experiment lifecycle management to accelerate discovery.

Details

Motivation: To accelerate scientific discovery by enabling a global network of autonomous scientific agents that can seamlessly collaborate across disparate platforms and institutional boundaries, while reducing integration overhead and enhancing reproducibility.

Method: SCP is built on two pillars: (1) Unified Resource Integration - a universal specification for describing and invoking scientific resources (tools, models, datasets, instruments), and (2) Orchestrated Experiment Lifecycle Management - a secure service architecture with centralized SCP Hub and federated SCP Servers that manages complete experiment lifecycle (registration, planning, execution, monitoring, archival) with fine-grained authentication and workflow orchestration.

Result: The authors have constructed a scientific discovery platform based on SCP offering over 1,600 tool resources. SCP facilitates secure, large-scale collaboration between heterogeneous AI systems and human researchers across diverse use cases while significantly reducing integration overhead and enhancing reproducibility.

Conclusion: SCP establishes essential infrastructure for scalable, multi-institution, agent-driven science by standardizing scientific context and tool orchestration at the protocol level, enabling autonomous scientific agents to accelerate discovery through global collaboration.

Abstract: We introduce SCP: the Science Context Protocol, an open-source standard designed to accelerate discovery by enabling a global network of autonomous scientific agents. SCP is built on two foundational pillars: (1) Unified Resource Integration: At its core, SCP provides a universal specification for describing and invoking scientific resources, spanning software tools, models, datasets, and physical instruments. This protocol-level standardization enables AI agents and applications to discover, call, and compose capabilities seamlessly across disparate platforms and institutional boundaries. (2) Orchestrated Experiment Lifecycle Management: SCP complements the protocol with a secure service architecture, which comprises a centralized SCP Hub and federated SCP Servers. This architecture manages the complete experiment lifecycle (registration, planning, execution, monitoring, and archival), enforces fine-grained authentication and authorization, and orchestrates traceable, end-to-end workflows that bridge computational and physical laboratories. Based on SCP, we have constructed a scientific discovery platform that offers researchers and agents a large-scale ecosystem of more than 1,600 tool resources. Across diverse use cases, SCP facilitates secure, large-scale collaboration between heterogeneous AI systems and human researchers while significantly reducing integration overhead and enhancing reproducibility. By standardizing scientific context and tool orchestration at the protocol level, SCP establishes essential infrastructure for scalable, multi-institution, agent-driven science.

[296] A Proof-of-Concept for Explainable Disease Diagnosis Using Large Language Models and Answer Set Programming

Ioanna Gemou, Evangelos Lamprou

Main category: cs.AI

TL;DR: McCoy framework combines LLMs with Answer Set Programming for automated disease diagnosis, translating medical literature to ASP code for interpretable predictions.

Details

Motivation: Symbolic AI in healthcare requires significant effort to build knowledge bases, limiting adoption. There's a need for accurate disease prediction for timely intervention and effective treatment.

Method: McCoy uses LLMs to translate medical literature into ASP (Answer Set Programming) code, combines this with patient data, and processes it using an ASP solver to generate final diagnoses.

Result: Preliminary results show strong performance on small-scale disease diagnosis tasks, creating a robust and interpretable prediction framework.

Conclusion: The integration of LLMs with ASP overcomes knowledge base construction barriers, yielding an interpretable diagnostic framework that leverages strengths of both symbolic and neural approaches.

Abstract: Accurate disease prediction is vital for timely intervention, effective treatment, and reducing medical complications. While symbolic AI has been applied in healthcare, its adoption remains limited due to the effort required for constructing high-quality knowledge bases. This work introduces McCoy, a framework that combines Large Language Models (LLMs) with Answer Set Programming (ASP) to overcome this barrier. McCoy orchestrates an LLM to translate medical literature into ASP code, combines it with patient data, and processes it using an ASP solver to arrive at the final diagnosis. This integration yields a robust, interpretable prediction framework that leverages the strengths of both paradigms. Preliminary results show McCoy has strong performance on small-scale disease diagnosis tasks.

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Main category: cs.AI

TL;DR: SPARK is a multi-agent LLM framework for personalized search that uses specialized persona-based agents with coordinated retrieval and knowledge-sharing to model users’ evolving, multi-dimensional information needs.

Details

Motivation: Traditional search systems struggle with static user profiles and monolithic retrieval pipelines that cannot capture users' evolving, multi-dimensional information needs. There's a need for systems that can model the complexity, fluidity, and context sensitivity of human information-seeking behavior.

Method: SPARK uses coordinated persona-based LLM agents with formalized persona spaces (role, expertise, task context, domain). A Persona Coordinator dynamically interprets queries to activate relevant specialized agents. Each agent has independent retrieval-augmented generation with dedicated memory stores and context-aware reasoning. Inter-agent collaboration uses structured communication protocols including shared memory repositories, iterative debate, and relay-style knowledge transfer.

Result: The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution. It incorporates adaptive learning mechanisms for continuous persona refinement and provides insights for next-generation search systems.

Conclusion: SPARK demonstrates how emergent personalization properties can arise from distributed agent behaviors governed by minimal coordination rules, offering a promising approach for capturing the complexity of human information-seeking behavior through fine-grained agent specialization and cooperative retrieval.

Abstract: Personalized search demands the ability to model users’ evolving, multi-dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent-Driven Retrieval and Knowledge-sharing), a framework in which coordinated persona-based large language model (LLM) agents deliver task-specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval-augmented generation process, supported by dedicated long- and short-term memory stores and context-aware reasoning modules. Inter-agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay-style knowledge transfer. Drawing on principles from cognitive architectures, multi-agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine-grained agent specialization with cooperative retrieval, SPARK provides insights for next-generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information-seeking behavior.

[298] ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment

Natchaya Temyingyong, Daman Jain, Neeraj Kumarsahu, Prabhat Kumar, Rachata Phondi, Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

Main category: cs.AI

TL;DR: ROAD is a novel framework for automatic prompt optimization that uses multi-agent debugging instead of labeled datasets, achieving significant performance improvements with minimal iterations.

Details

Motivation: Current APO methods require large labeled datasets which are unavailable during cold start development, while engineers only have messy production logs and evolving failure modes.

Method: ROAD treats optimization as dynamic debugging with a multi-agent architecture: Analyzer for root-cause analysis, Optimizer for pattern aggregation, and Coach for strategy integration, converting failure logs into Decision Tree Protocols.

Result: ROAD achieved 5.6% increase in success rate (73.6% to 79.2%) and 3.8% increase in search accuracy within three iterations, plus 19% improvement on complex retail reasoning tasks.

Conclusion: Mimicking human engineering loops of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive RL training for deploying reliable LLM agents.

Abstract: Automatic Prompt Optimization (APO) has emerged as a critical technique for enhancing Large Language Model (LLM) performance, yet current state-of-the-art methods typically rely on large, labeled gold-standard development sets to compute fitness scores for evolutionary or Reinforcement Learning (RL) approaches. In real-world software engineering, however, such curated datasets are rarely available during the initial cold start of agent development, where engineers instead face messy production logs and evolving failure modes. We present ROAD (Reflective Optimization via Automated Debugging), a novel framework that bypasses the need for refined datasets by treating optimization as a dynamic debugging investigation rather than a stochastic search. Unlike traditional mutation strategies, ROAD utilizes a specialized multi-agent architecture, comprising an Analyzer for root-cause analysis, an Optimizer for pattern aggregation, and a Coach for strategy integration, to convert unstructured failure logs into robust, structured Decision Tree Protocols. We evaluated ROAD across both a standardized academic benchmark and a live production Knowledge Management engine. Experimental results demonstrate that ROAD is highly sample-efficient, achieving a 5.6 percent increase in success rate (73.6 percent to 79.2 percent) and a 3.8 percent increase in search accuracy within just three automated iterations. Furthermore, on complex reasoning tasks in the retail domain, ROAD improved agent performance by approximately 19 percent relative to the baseline. These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive RL training for deploying reliable LLM agents.

[299] LoongFlow: Directed Evolutionary Search via a Cognitive Plan-Execute-Summarize Paradigm

Chunhui Wan, Xunan Dai, Zhuo Wang, Minglei Li, Yanpeng Wang, Yinan Mao, Yu Lan, Zhiwen Xiao

Main category: cs.AI

TL;DR: LoongFlow is a self-evolving agent framework that integrates LLMs into evolutionary search via a cognitive “Plan-Execute-Summarize” paradigm, achieving state-of-the-art solution quality with 60% better efficiency than existing methods.

Details

Motivation: Traditional evolutionary approaches for LLMs lack structured reasoning, leading to premature convergence and inefficient exploration in high-dimensional code spaces. There's a need for more intelligent, reasoning-heavy evolutionary methods.

Method: LoongFlow integrates LLMs into a cognitive “Plan-Execute-Summarize” (PES) paradigm, replacing blind mutation with reasoning-heavy processes. It uses a hybrid evolutionary memory system combining Multi-Island models with MAP-Elites and adaptive Boltzmann selection to balance exploration-exploitation trade-offs and maintain behavioral diversity.

Result: LoongFlow outperforms leading baselines (OpenEvolve, ShinkaEvolve) by up to 60% in evolutionary efficiency on AlphaEvolve benchmark and Kaggle competitions, discovering superior solutions with significantly reduced computational costs.

Conclusion: LoongFlow represents a substantial advancement in autonomous scientific discovery, enabling generation of expert-level solutions with reduced computational overhead by synergizing LLM reasoning with evolutionary optimization.

Abstract: The transition from static Large Language Models (LLMs) to self-improving agents is hindered by the lack of structured reasoning in traditional evolutionary approaches. Existing methods often struggle with premature convergence and inefficient exploration in high-dimensional code spaces. To address these challenges, we introduce LoongFlow, a self-evolving agent framework that achieves state-of-the-art solution quality with significantly reduced computational costs. Unlike “blind” mutation operators, LoongFlow integrates LLMs into a cognitive “Plan-Execute-Summarize” (PES) paradigm, effectively mapping the evolutionary search to a reasoning-heavy process. To sustain long-term architectural coherence, we incorporate a hybrid evolutionary memory system. By synergizing Multi-Island models with MAP-Elites and adaptive Boltzmann selection, this system theoretically balances the exploration-exploitation trade-off, maintaining diverse behavioral niches to prevent optimization stagnation. We instantiate LoongFlow with a General Agent for algorithmic discovery and an ML Agent for pipeline optimization. Extensive evaluations on the AlphaEvolve benchmark and Kaggle competitions demonstrate that LoongFlow outperforms leading baselines (e.g., OpenEvolve, ShinkaEvolve) by up to 60% in evolutionary efficiency while discovering superior solutions. LoongFlow marks a substantial step forward in autonomous scientific discovery, enabling the generation of expert-level solutions with reduced computational overhead.

[300] CogRec: A Cognitive Recommender Agent Fusing Large Language Models and Soar for Explainable Recommendation

Jiaxin Hu, Tao Wang, Bingsan Yang, Hongrun Wang

Main category: cs.AI

TL;DR: CogRec is a cognitive recommender agent that combines LLMs with the Soar cognitive architecture to address LLMs’ black-box nature, hallucination issues, and limited online learning, while leveraging Soar’s structured reasoning and using LLMs for knowledge initialization.

Details

Motivation: LLMs have shown promise for recommendation systems but suffer from black-box characteristics, knowledge hallucination, and limited online learning capacity, compromising trustworthiness and adaptability. Meanwhile, cognitive architectures like Soar offer structured, interpretable reasoning but require laborious knowledge acquisition.

Method: CogRec synergizes LLMs with Soar cognitive architecture: uses Soar as symbolic reasoning engine, LLM for knowledge initialization to populate working memory with production rules. Operates on Perception-Cognition-Action cycle, dynamically queries LLM when encountering impasses, transforms LLM solutions into symbolic production rules via Soar’s chunking mechanism for online learning.

Result: Extensive evaluations on three public datasets show CogRec demonstrates significant advantages in recommendation accuracy, explainability, and efficacy in addressing the long-tail problem.

Conclusion: The proposed CogRec agent successfully combines LLMs’ knowledge capabilities with Soar’s structured reasoning to create a trustworthy, adaptable recommender system with interpretable rationales and continuous learning capacity.

Abstract: Large Language Models (LLMs) have demonstrated a remarkable capacity in understanding user preferences for recommendation systems. However, they are constrained by several critical challenges, including their inherent “Black-Box” characteristics, susceptibility to knowledge hallucination, and limited online learning capacity. These factors compromise their trustworthiness and adaptability. Conversely, cognitive architectures such as Soar offer structured and interpretable reasoning processes, yet their knowledge acquisition is notoriously laborious. To address these complementary challenges, we propose a novel cognitive recommender agent called CogRec which synergizes the strengths of LLMs with the Soar cognitive architecture. CogRec leverages Soar as its core symbolic reasoning engine and leverages an LLM for knowledge initialization to populate its working memory with production rules. The agent operates on a Perception-Cognition-Action(PCA) cycle. Upon encountering an impasse, it dynamically queries the LLM to obtain a reasoned solution. This solution is subsequently transformed into a new symbolic production rule via Soar’s chunking mechanism, thereby enabling robust online learning. This learning paradigm allows the agent to continuously evolve its knowledge base and furnish highly interpretable rationales for its recommendations. Extensive evaluations conducted on three public datasets demonstrate that CogRec demonstrates significant advantages in recommendation accuracy, explainability, and its efficacy in addressing the long-tail problem.

Zijian Zhao, Sen Li

Main category: cs.AI

TL;DR: Two novel MARL methods (GRPO and OSPO) for AV order dispatch that bypass value function estimation by leveraging fleet homogeneity, achieving better performance than conventional approaches.

Details

Motivation: Conventional MARL-based order dispatch approaches for AV fleets heavily rely on accurate value function estimation, which becomes problematic in large-scale, highly uncertain ride-sharing environments with many vehicles, passengers, and orders.

Method: 1) GRPO: Adapts Group Relative Policy Optimization to order dispatch by replacing PPO baseline with group average reward-to-go, eliminating critic estimation errors. 2) OSPO: One-Step Policy Optimization that trains optimal policy using only one-step group rewards under homogeneous fleet assumption.

Result: Both GRPO and OSPO achieve promising performance across all scenarios on real-world ride-hailing data, efficiently optimizing pickup times and number of served orders using simple MLP networks. OSPO outperforms GRPO in all scenarios by eliminating bias from GRPO’s bounded time horizon.

Conclusion: The proposed methods successfully address the limitations of conventional MARL approaches by bypassing value function estimation through fleet homogeneity, with OSPO showing superior performance by using only one-step group rewards for policy optimization.

Abstract: Order dispatch is a critical task in ride-sharing systems with Autonomous Vehicles (AVs), directly influencing efficiency and profits. Recently, Multi-Agent Reinforcement Learning (MARL) has emerged as a promising solution to this problem by decomposing the large state and action spaces among individual agents, effectively addressing the Curse of Dimensionality (CoD) in transportation market, which is caused by the substantial number of vehicles, passengers, and orders. However, conventional MARL-based approaches heavily rely on accurate estimation of the value function, which becomes problematic in large-scale, highly uncertain environments. To address this issue, we propose two novel methods that bypass value function estimation, leveraging the homogeneous property of AV fleets. First, we draw an analogy between AV fleets and groups in Group Relative Policy Optimization (GRPO), adapting it to the order dispatch task. By replacing the Proximal Policy Optimization (PPO) baseline with the group average reward-to-go, GRPO eliminates critic estimation errors and reduces training bias. Inspired by this baseline replacement, we further propose One-Step Policy Optimization (OSPO), demonstrating that the optimal policy can be trained using only one-step group rewards under a homogeneous fleet. Experiments on a real-world ride-hailing dataset show that both GRPO and OSPO achieve promising performance across all scenarios, efficiently optimizing pickup times and the number of served orders using simple Multilayer Perceptron (MLP) networks. Furthermore, OSPO outperforms GRPO in all scenarios, attributed to its elimination of bias caused by the bounded time horizon of GRPO. Our code, trained models, and processed data are provided at https://github.com/RS2002/OSPO .

[302] Graph-Based Exploration for ARC-AGI-3 Interactive Reasoning Tasks

Evgenii Rudakov, Jonathan Shock, Benjamin Ultan Cowley

Main category: cs.AI

TL;DR: Training-free graph-based approach for ARC-AGI-3 benchmark that combines vision processing with systematic state-space exploration using graph representations, outperforming state-of-the-art LLMs.

Details

Motivation: Current state-of-the-art LLMs fail to reliably solve interactive reasoning tasks in ARC-AGI-3, which requires hypothesis formation, testing, and tracking discovered mechanics in game-like environments with increasing complexity.

Method: Combines vision-based frame processing (segmenting visual frames into meaningful components) with systematic state-space exploration using graph-structured representations. Maintains directed graph of explored states and transitions, prioritizes actions based on visual salience and shortest path to untested state-action pairs.

Result: Solves median of 30 out of 52 levels across six games on ARC-AGI-3 Preview Challenge, ranking 3rd on private leaderboard, substantially outperforming frontier LLM-based agents.

Conclusion: Explicit graph-structured exploration without learning serves as strong baseline for interactive reasoning, highlighting importance of systematic state tracking and action prioritization in sparse-feedback environments where current LLMs fail.

Abstract: We present a training-free graph-based approach for solving interactive reasoning tasks in the ARC-AGI-3 benchmark. ARC-AGI-3 comprises game-like tasks where agents must infer task mechanics through limited interactions, and adapt to increasing complexity as levels progress. Success requires forming hypotheses, testing them, and tracking discovered mechanics. The benchmark has revealed that state-of-the-art LLMs are currently incapable of reliably solving these tasks. Our method combines vision-based frame processing with systematic state-space exploration using graph-structured representations. It segments visual frames into meaningful components, prioritizes actions based on visual salience, and maintains a directed graph of explored states and transitions. By tracking visited states and tested actions, the agent prioritizes actions that provide the shortest path to untested state-action pairs. On the ARC-AGI-3 Preview Challenge, this structured exploration strategy solves a median of 30 out of 52 levels across six games and ranks 3rd on the private leaderboard, substantially outperforming frontier LLM-based agents. These results demonstrate that explicit graph-structured exploration, even without learning, can serve as a strong baseline for interactive reasoning and underscore the importance of systematic state tracking and action prioritization in sparse-feedback environments where current LLMs fail to capture task dynamics. The code is open source and available at https://github.com/dolphin-in-a-coma/arc-agi-3-just-explore.

[303] Deep Reinforcement Learning for Solving the Fleet Size and Mix Vehicle Routing Problem

Pengfu Wan, Jiawei Chen, Gangyan Xu

Main category: cs.AI

TL;DR: A deep reinforcement learning approach for solving Fleet Size and Mix Vehicle Routing Problem (FSMVRP) that simultaneously optimizes fleet composition and routing decisions using a novel policy network called FRIPN.

Details

Motivation: FSMVRP is complex and challenging for large-scale, time-constrained real-world applications like vehicle rental and on-demand logistics. Traditional methods struggle with computational efficiency and scalability.

Method: Formulate FSMVRP as Markov Decision Process (MDP) and develop FRIPN policy network that integrates fleet composition and routing decisions. Use specialized input embeddings including remaining graph embedding for vehicle employment decisions.

Result: Method generates near-optimal solutions within seconds, shows computational efficiency and scalability advantages, especially in large-scale and time-constrained scenarios on both random and benchmark datasets.

Conclusion: The DRL-based approach has practical application potential and provides inspiration for extending similar techniques to other VRP variants.

Abstract: The Fleet Size and Mix Vehicle Routing Problem (FSMVRP) is a prominent variant of the Vehicle Routing Problem (VRP), extensively studied in operations research and computational science. FSMVRP requires simultaneous decisions on fleet composition and routing, making it highly applicable to real-world scenarios such as short-term vehicle rental and on-demand logistics. However, these requirements also increase the complexity of FSMVRP, posing significant challenges, particularly in large-scale and time-constrained environments. In this paper, we propose a deep reinforcement learning (DRL)-based approach for solving FSMVRP, capable of generating near-optimal solutions within a few seconds. Specifically, we formulate the problem as a Markov Decision Process (MDP) and develop a novel policy network, termed FRIPN, that seamlessly integrates fleet composition and routing decisions. Our method incorporates specialized input embeddings designed for distinctdecision objectives, including a remaining graph embedding to facilitate effective vehicle employment decisions. Comprehensive experiments are conducted on both randomly generated instances and benchmark datasets. The experimental results demonstrate that our method exhibits notable advantages in terms of computational efficiency and scalability, particularly in large-scale and time-constrained scenarios. These strengths highlight the potential of our approach for practical applications and provide valuable inspiration for extending DRL-based techniques to other variants of VRP.

[304] Toward Autonomous Engineering Design: A Knowledge-Guided Multi-Agent Framework

Varun Kumar, George Em Karniadakis

Main category: cs.AI

TL;DR: A multi-agent AI framework for engineering design that uses specialized agents (Graph Ontologist, Design Engineer, Systems Engineer) to collaboratively generate and refine designs through structured knowledge graphs and iterative feedback loops, demonstrated on NACA airfoil optimization.

Details

Motivation: Traditional engineering design processes are resource-intensive, inefficient, and require multi-domain expertise, leading to complex collaborations and iterative refinements that need improvement.

Method: A three-agent framework: 1) Graph Ontologist uses LLM to build domain-specific knowledge graphs from literature; 2) Systems Engineer formulates technical requirements from human input; 3) Design Engineer proposes candidates using design knowledge graphs and computational tools; with iterative feedback loops between agents until validation.

Result: Demonstrated successful application to aerodynamic optimization of 4-digit NACA airfoils, showing how collaborative AI agents with structured knowledge representations can enhance the design process.

Conclusion: The multi-agent AI framework enhances efficiency, consistency, and quality in engineering design by enabling collaborative agents with structured knowledge to work together through iterative design-review loops.

Abstract: The engineering design process often demands expertise from multiple domains, leading to complex collaborations and iterative refinements. Traditional methods can be resource-intensive and prone to inefficiencies. To address this, we formalize the engineering design process through a multi-agent AI framework that integrates structured design and review loops. The framework introduces specialized knowledge-driven agents that collaborate to generate and refine design candidates. As an exemplar, we demonstrate its application to the aerodynamic optimization of 4-digit NACA airfoils. The framework consists of three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems Engineer. The Graph Ontologist employs a Large Language Model (LLM) to construct two domain-specific knowledge graphs from airfoil design literature. The Systems Engineer, informed by a human manager, formulates technical requirements that guide design generation and evaluation. The Design Engineer leverages the design knowledge graph and computational tools to propose candidate airfoils meeting these requirements. The Systems Engineer reviews and provides feedback both qualitative and quantitative using its own knowledge graph, forming an iterative feedback loop until a design is validated by the manager. The final design is then optimized to maximize performance metrics such as the lift-to-drag ratio. Overall, this work demonstrates how collaborative AI agents equipped with structured knowledge representations can enhance efficiency, consistency, and quality in the engineering design process.

[305] Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment

Lijun Zhang, Lin Li, Wei Wei, Yajie Qi, Huizhong Song, Jun Wang, Yaodong Yang, Jiye Liang

Main category: cs.AI

TL;DR: RSA is a risk-aware alignment method that uses nested risk measures to control safety risks during language model fine-tuning, addressing limitations of risk-neutral approaches by mitigating excessive model shifts and suppressing rare but catastrophic harmful behaviors.

Details

Motivation: Existing safety alignment methods like Safe RLHF and SACPO operate under risk-neutral paradigms that are insufficient for controlling risks from policy deviations and lack robustness against rare but potentially catastrophic harmful behaviors.

Method: RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through stepwise alignment with token-level policy updates derived from nested risk measures.

Result: Experimental results show RSA achieves high helpfulness while ensuring strong safety and significantly suppresses tail risks (low-probability yet high-impact unsafe responses).

Conclusion: RSA provides an effective risk-aware alignment approach that explicitly incorporates risk awareness into policy optimization, offering better control over safety risks compared to risk-neutral methods.

Abstract: When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy updates derived from the nested risk measures. This design offers two key benefits: (1) it mitigates risks induced by excessive model shift away from a reference policy, and (2) it explicitly suppresses low-probability yet high-impact harmful behaviors. Moreover, we provide theoretical analysis on policy optimality under mild assumptions. Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety and significantly suppresses tail risks, namely low-probability yet high-impact unsafe responses.

[306] Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents

Seohui Bae, Jeonghye Kim, Youngchul Sung, Woohyung Lim

Main category: cs.AI

TL;DR: Proposes a test-time adaptive agent that refines beliefs through posterior-guided exploration without gradient updates or additional training, using structured belief maintenance and information gain maximization for action selection.

Details

Motivation: Addresses the challenge of LLM agents operating under partial observability where they need to align with latent world states without relying on computationally expensive gradient-based updates or additional training.

Method: Maintains external structured belief over environment state, iteratively updates via action-conditioned observations, selects actions by maximizing predicted information gain using lightweight LLM-based surrogate, and assesses world alignment through novel consistency reward between posterior belief and ground-truth.

Result: Outperforms inference-time scaling baselines (prompt-augmented or retrieval-enhanced LLMs) in aligning with latent world states with significantly lower integration overhead.

Conclusion: Demonstrates effective test-time adaptation through posterior-guided belief refinement without requiring gradient updates or additional training, offering a practical approach for LLM agents in partially observable environments.

Abstract: In this paper, we propose a test-time adaptive agent that performs exploratory inference through posterior-guided belief refinement without relying on gradient-based updates or additional training for LLM agent operating under partial observability. Our agent maintains an external structured belief over the environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM-based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground-truth environment configuration. Experiments show that our method outperforms inference-time scaling baselines such as prompt-augmented or retrieval-enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.

[307] What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

Main category: cs.AI

TL;DR: The paper investigates JEPA-based world models for efficient planning in representation space, analyzing key components and proposing an improved model that outperforms baselines in navigation and manipulation tasks.

Details

Motivation: To develop AI agents capable of solving diverse physical tasks and generalizing to unseen tasks/environments, with focus on improving planning efficiency through representation-space planning rather than input-space planning.

Method: Characterizes JEPA-WMs (Joint Embedding Predictive Architecture World Models), conducts comprehensive study of model architecture, training objectives, and planning algorithms using simulated environments and real-world robotic data.

Result: Proposed model outperforms established baselines DINO-WM and V-JEPA-2-AC in both navigation and manipulation tasks, with code, data and checkpoints made publicly available.

Conclusion: JEPA-based world models with optimized components enable more efficient planning in representation space, advancing the development of generalizable AI agents for physical tasks.

Abstract: A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

[308] Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments

Zhiwei Wei, Yuxing Liu, Hua Liao, Wenjia Xu

Main category: cs.AI

TL;DR: Interactive evaluation framework for FM agents’ spatial understanding in symbolic map environments, revealing distinct roles of exploration, memory, and reasoning components.

Details

Motivation: Existing evaluations of spatial ability in foundation models rely on static map inputs or text-based queries, overlooking interactive and experience-driven nature of spatial understanding needed for reliable map-based reasoning and applications.

Method: Proposed interactive evaluation framework where agents incrementally explore partially observable grid-based maps (roads, intersections, POIs) with local observations, then evaluated on six spatial tasks. Systematically varied exploration strategies, memory representations, and reasoning schemes across multiple foundation models.

Result: Exploration affects experience acquisition but has limited impact on final reasoning accuracy. Memory representation plays central role in consolidating spatial experience, with structured memories (sequential and graph-based) substantially improving performance on structure-intensive tasks like path planning. Reasoning schemes shape how stored knowledge is used, with advanced prompts supporting more effective multi-step inference. Spatial reasoning performance saturates beyond certain capability threshold.

Conclusion: Improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone, as performance saturates across model versions and scales beyond certain capability threshold.

Abstract: Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is therefore critical for enabling reliable map-based reasoning and applications. However, most existing evaluations of spatial ability in FMs rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding.In this paper, we propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments. Agents incrementally explore partially observable grid-based maps consisting of roads, intersections, and points of interest (POIs), receiving only local observations at each step. Spatial understanding is then evaluated using six kinds of spatial tasks. By systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models, we reveal distinct functional roles of these components. Exploration primarily affects experience acquisition but has a limited impact on final reasoning accuracy. In contrast, memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning. Reasoning schemes further shape how stored spatial knowledge is used, with advanced prompts supporting more effective multi-step inference. We further observe that spatial reasoning performance saturates across model versions and scales beyond a certain capability threshold, indicating that improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone.

[309] Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Samuel Golladay, Majid Bani-Yaghoub

Main category: cs.AI

TL;DR: LLMs struggle with underrepresented math competition problems, especially Geometry, with DeepSeek-V3 performing best but still making computational/logical errors.

Details

Motivation: Most LLM math reasoning studies use the same benchmark datasets, limiting generalizability and failing to capture diverse mathematical challenges. This study aims to analyze LLM performance on underrepresented mathematics competition problems to better understand their limitations.

Method: Tested three leading LLMs (GPT-4o-mini, Gemini-2.0-Flash, DeepSeek-V3) on Missouri Collegiate Mathematics Competition problems in Calculus, Analytic Geometry, and Discrete Mathematics. Compared LLM responses to known correct solutions and analyzed reasoning patterns to identify error types across problem domains.

Result: DeepSeek-V3 performed best across all three categories in both reasoning and final answers. All LLMs showed notably weak performance in Geometry. DeepSeek-V3 errors were mainly computational/logical, GPT-4o-mini had logical/approach errors, and Gemini struggled with incomplete reasoning and rushed conclusions.

Conclusion: Evaluating LLMs on underrepresented mathematics competition datasets reveals distinct error patterns and highlights ongoing challenges in structured reasoning, particularly in Geometry, suggesting current benchmarks may not fully capture LLMs’ mathematical reasoning limitations.

Abstract: Understanding the limitations of Large Language Models, or LLMs, in mathematical reasoning has been the focus of several recent studies. However, the majority of these studies use the same datasets for benchmarking, which limits the generalizability of their findings and may not fully capture the diverse challenges present in mathematical tasks. The purpose of the present study is to analyze the performance of LLMs on underrepresented mathematics competition problems. We prompted three leading LLMs, namely GPT-4o-mini, Gemini-2.0-Flash, and DeepSeek-V3, with the Missouri Collegiate Mathematics Competition problems in the areas of Calculus, Analytic Geometry, and Discrete Mathematics. The LLMs responses were then compared to the known correct solutions in order to determine the accuracy of the LLM for each problem domain. We also analyzed the LLMs reasoning to explore patterns in errors across problem types and models. DeepSeek-V3 has the best performance in all three categories of Calculus, Analytic Geometry, and Discrete Mathematics, both in reasoning and correct final answers. All three LLMs exhibited notably weak performance in Geometry. The majority of errors made by DeepSeek-V3 were attributed to computational and logical mistakes, whereas GPT-4o-mini frequently exhibited logical and approach-related errors. Gemini, on the other hand, tended to struggle with incomplete reasoning and drawing rushed conclusions. In conclusion, evaluating LLMs on underrepresented mathematics competition datasets can provide deeper insights into their distinct error patterns and highlight ongoing challenges in structured reasoning, particularly within the domain of Geometry.

[310] From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning

Amir Tahmasbi, Sadegh Majidi, Kazem Taram, Aniket Bera

Main category: cs.AI

TL;DR: A two-stage approach for spatial reasoning in LLMs: first fine-tune on basic spatial transformations, then train lightweight adapters for multi-step planning in puzzle environments, outperforming baselines with faster convergence.

Details

Motivation: LLMs have strong general language capabilities but struggle with spatial transformations and multi-step planning in structured environments, which is important for applications in navigation and planning.

Method: Two-stage approach: 1) Supervised fine-tuning on elementary spatial transformations (rotation, translation, scaling) to give model basic spatial physics; 2) Freeze physics-aware model and train lightweight LoRA adapters within GRPO framework to learn policies for composing building blocks in multi-step planning. Uses synthesized ASCII-art dataset and ASCII-based RL environment.

Result: Method consistently outperforms baselines (generic backbone, physics-aware model, end-to-end RL models) in both Dynamic environments with explicit state updates and Static environments requiring internal state tracking. Approach converges faster and exhibits more stable training than end-to-end RL from scratch. Attention pattern analysis shows meaningful improvements in spatial understanding.

Conclusion: The proposed two-stage decomposition of spatial reasoning into atomic building blocks and their composition effectively enhances LLMs’ spatial reasoning capabilities, with better performance, faster convergence, and more stable training compared to existing approaches.

Abstract: Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.

[311] MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

Wenrui Liu, Zixiang Liu, Elsie Dai, Wenhan Yu, Lei Yu, Tong Yang

Main category: cs.AI

TL;DR: MCPAgentBench: A benchmark for evaluating LLM agents’ tool-use capabilities using real-world MCP definitions and simulated tools in a dynamic sandbox environment with distractor tools.

Details

Motivation: Current MCP evaluation sets have limitations: they rely on external MCP services and lack difficulty awareness. As LLMs increasingly serve as autonomous agents using tools via MCP, there's a need for better evaluation benchmarks.

Method: Created MCPAgentBench with authentic tasks and simulated MCP tools. Uses dynamic sandbox environment with candidate tool lists containing distractors to test tool selection and discrimination. Introduces comprehensive metrics for task completion rates and execution efficiency.

Result: Experiments on various latest mainstream LLMs reveal significant performance differences in handling complex, multi-step tool invocations. The benchmark effectively evaluates tool-use capabilities.

Conclusion: MCPAgentBench addresses limitations of current MCP evaluation methods and provides a robust benchmark for assessing LLM agents’ tool-use capabilities. The code is open-source for community use.

Abstract: Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real-world MCP definitions designed to evaluate the tool-use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi-step tool invocations. All code is open-source at Github.

[312] Recursive Language Models

Alex L. Zhang, Tim Kraska, Omar Khattab

Main category: cs.AI

TL;DR: RLMs enable LLMs to process arbitrarily long prompts via recursive self-calling on prompt snippets, achieving 100x context scaling with better quality and comparable cost.

Details

Motivation: LLMs have limited context windows that restrict their ability to process arbitrarily long prompts, creating a need for inference-time scaling solutions.

Method: Propose Recursive Language Models (RLMs) - an inference strategy where LLMs treat long prompts as external environments, programmatically examine/decompose them, and recursively call themselves on prompt snippets.

Result: RLMs handle inputs up to 100x beyond model context windows, outperform base LLMs and common long-context scaffolds across four diverse tasks, with comparable or cheaper cost per query.

Conclusion: RLMs provide an effective general inference strategy for processing arbitrarily long prompts, achieving dramatic context scaling with quality improvements and cost efficiency.

Abstract: We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

[313] Reinforcement Learning-Augmented LLM Agents for Collaborative Decision Making and Performance Optimization

Dong Qiu, Duo Xu, Limengxi Yue

Main category: cs.AI

TL;DR: RL-augmented LLM framework for multi-agent collaboration using Dec-POMDP formulation with CTDE, achieving 3x speedup in collaborative tasks.

Details

Motivation: LLMs lack collaborative awareness and struggle with global optimization in multi-agent settings, needing better coordination for complex workflows.

Method: Formulates cooperation as Dec-POMDP with CTDE, introduces Group Relative Policy Optimization (GRPO) for joint policy optimization with global signals, uses simplified joint reward balancing task quality, speed, and coordination cost.

Result: 3x increase in task processing speed over single-agent baselines, 98.7% structural/style consistency in writing, 74.6% test pass rate in coding, consistently outperforms multi-agent LLM baselines.

Conclusion: The framework provides a practical path toward reliable collaboration in complex workflows by enabling effective multi-agent LLM coordination.

Abstract: Large Language Models (LLMs) perform well in language tasks but often lack collaborative awareness and struggle to optimize global performance in multi-agent settings. We present a reinforcement learning-augmented LLM agent framework that formulates cooperation as a decentralized partially observable Markov decision process (Dec-POMDP) and adopts centralized training with decentralized execution (CTDE). We introduce Group Relative Policy Optimization (GRPO) to jointly optimize agent policies with access to global signals during training, together with a simplified joint reward that balances task quality, speed, and coordination cost. On collaborative writing and coding benchmarks, our framework delivers a 3x increase in task processing speed over single-agent baselines, 98.7% structural/style consistency in writing, and a 74.6% test pass rate in coding. The approach consistently outperforms strong multi-agent LLM baselines and provides a practical path toward reliable collaboration in complex workflows.

[314] Group Deliberation Oriented Multi-Agent Conversational Model for Complex Reasoning

Zheyu Shi, Dong Qiu, Shanlong Yu

Main category: cs.AI

TL;DR: A multi-agent conversational model with three-level role division (generation, verification, integration) improves complex reasoning through self-game mechanisms, retrieval enhancement, and composite reward training.

Details

Motivation: To address limitations of single large language models in complex reasoning tasks by leveraging group deliberation dynamics and multi-agent collaboration.

Method: Three-level role architecture: opinion generation agent produces diverse perspectives, evidence verification agent retrieves external knowledge, consistency arbitration agent integrates conclusions. Includes self-game mechanism for multi-path reasoning, retrieval enhancement module, composite reward function (factual consistency + logical coherence), and improved proximal policy optimization for collaborative training.

Result: Improves multi-hop reasoning accuracy by 16.8% on HotpotQA, 14.3% on 2WikiMultihopQA, and 19.2% on MeetingBank, while improving consistency by 21.5%. Achieves higher reasoning efficiency than mainstream multi-agent approaches.

Conclusion: The proposed model provides an effective and stable solution for complex reasoning tasks through group deliberation oriented multi-agent collaboration with specialized role division and training mechanisms.

Abstract: This paper proposes a group deliberation oriented multi-agent conversational model to address the limitations of single large language models in complex reasoning tasks. The model adopts a three-level role division architecture consisting of generation, verification, and integration. An opinion generation agent produces diverse reasoning perspectives, an evidence verification agent retrieves external knowledge and quantifies factual support, and a consistency arbitration agent integrates logically coherent conclusions. A self-game mechanism is introduced to expand multi-path reasoning trajectories, while a retrieval enhancement module dynamically supplements external knowledge. A composite reward function combining factual consistency and logical coherence is designed, and an improved proximal policy optimization strategy is applied for collaborative training. Experimental results show that the proposed model improves multi-hop reasoning accuracy by 16.8 percent on HotpotQA, 14.3 percent on 2WikiMultihopQA, and 19.2 percent on MeetingBank, while improving consistency by 21.5 percent. The model achieves higher reasoning efficiency than mainstream multi-agent approaches, providing an effective and stable solution for complex reasoning tasks.

[315] Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun

Main category: cs.AI

TL;DR: Youtu-Agent is a modular LLM agent framework that automates agent generation and enables continuous evolution through workflow/meta-agent modes and hybrid policy optimization.

Details

Motivation: Existing LLM agent frameworks have high configuration costs and static capabilities, requiring extensive manual effort for tool integration and prompt engineering, and struggling to adapt to dynamic environments without expensive fine-tuning.

Method: Proposes Youtu-Agent with structured configuration system decoupling execution environments, toolkits, and context management. Features two generation paradigms: Workflow mode for standard tasks and Meta-Agent mode for complex requirements. Establishes hybrid policy optimization with Agent Practice module (in-context optimization) and Agent RL module (distributed reinforcement learning).

Result: Achieves SOTA performance on WebWalkerQA (71.47%) and GAIA (72.8%) with open-weight models. Automated generation achieves 81% tool synthesis success rate. Practice module improves AIME 2024/2025 performance by +2.7% and +5.4%. Agent RL achieves 40% speedup with steady improvement, enhancing coding/reasoning by 35% and searching by 21% on benchmarks.

Conclusion: Youtu-Agent successfully addresses configuration cost and static capability challenges through automated generation and continuous evolution, demonstrating significant performance improvements across multiple benchmarks.

Abstract: Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47%) and GAIA (72.8%) using open-weight models. Our automated generation pipeline achieves over 81% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7% and +5.4% respectively. Moreover, our Agent RL training achieves 40% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35% and 21% on Maths and general/multi-hop QA benchmarks.

Pengcheng Xia, Yixiang Huang, Chengjin Qin, Chengliang Liu

Main category: cs.AI

TL;DR: A multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis that addresses domain shift and leverages multi-modal information for better generalization to unseen working conditions.

Details

Motivation: Existing fault diagnosis methods suffer performance decline under unseen working conditions, domain adaptation relies on target domain samples, and most studies use single-modal signals, missing complementary multi-modal information.

Method: Proposes a dual disentanglement framework to separate modality-invariant/specific features and domain-invariant/specific representations, cross-domain mixed fusion for diversity augmentation, and triple-modal fusion for adaptive multi-modal integration.

Result: Extensive experiments on induction motor fault diagnosis under unseen constant and time-varying conditions show the method consistently outperforms advanced methods, with ablation studies verifying each component’s effectiveness.

Conclusion: The proposed multi-modal cross-domain mixed fusion model with dual disentanglement effectively addresses domain shift and leverages multi-modal information for robust fault diagnosis under unseen working conditions.

Abstract: Intelligent fault diagnosis has become an indispensable technique for ensuring machinery reliability. However, existing methods suffer significant performance decline in real-world scenarios where models are tested under unseen working conditions, while domain adaptation approaches are limited to their reliance on target domain samples. Moreover, most existing studies rely on single-modal sensing signals, overlooking the complementary nature of multi-modal information for improving model generalization. To address these limitations, this paper proposes a multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis. A dual disentanglement framework is developed to decouple modality-invariant and modality-specific features, as well as domain-invariant and domain-specific representations, enabling both comprehensive multi-modal representation learning and robust domain generalization. A cross-domain mixed fusion strategy is designed to randomly mix modality information across domains for modality and domain diversity augmentation. Furthermore, a triple-modal fusion mechanism is introduced to adaptively integrate multi-modal heterogeneous information. Extensive experiments are conducted on induction motor fault diagnosis under both unseen constant and time-varying working conditions. The results demonstrate that the proposed method consistently outperforms advanced methods and comprehensive ablation studies further verify the effectiveness of each proposed component and multi-modal fusion. The code is available at: https://github.com/xiapc1996/MMDG.

[317] BatteryAgent: Synergizing Physics-Informed Interpretation with LLM Reasoning for Intelligent Battery Fault Diagnosis

Songqi Zhou, Ruixue Liu, Boman Su, Jiazhou Wang, Yixing Wang, Benben Jiang

Main category: cs.AI

TL;DR: BatteryAgent: A hierarchical framework combining physical knowledge features with LLM reasoning for interpretable, multi-type battery fault diagnosis with root cause analysis and maintenance recommendations.

Details

Motivation: Existing deep learning methods for battery fault diagnosis have two key limitations: they lack interpretability due to their "black-box" nature, and they are restricted to binary classification paradigms, preventing root cause analysis and maintenance recommendations.

Method: A three-layer hierarchical framework: (1) Physical Perception Layer extracts 10 mechanism-based features from electrochemical principles; (2) Detection and Attribution Layer uses Gradient Boosting Decision Trees and SHAP for feature contribution quantification; (3) Reasoning and Diagnosis Layer employs LLM as agent core to bridge numerical-semantic gap, combining SHAP attributions with mechanism knowledge base.

Result: Achieves AUROC of 0.986, significantly outperforming state-of-the-art methods, effectively corrects misclassifications on hard boundary samples, and extends binary detection to multi-type interpretable diagnosis.

Conclusion: BatteryAgent represents a paradigm shift from “passive detection” to “intelligent diagnosis” for battery safety management, offering comprehensive fault diagnosis with interpretability, root cause analysis, and maintenance suggestions.

Abstract: Fault diagnosis of lithium-ion batteries is critical for system safety. While existing deep learning methods exhibit superior detection accuracy, their “black-box” nature hinders interpretability. Furthermore, restricted by binary classification paradigms, they struggle to provide root cause analysis and maintenance recommendations. To address these limitations, this paper proposes BatteryAgent, a hierarchical framework that integrates physical knowledge features with the reasoning capabilities of Large Language Models (LLMs). The framework comprises three core modules: (1) A Physical Perception Layer that utilizes 10 mechanism-based features derived from electrochemical principles, balancing dimensionality reduction with physical fidelity; (2) A Detection and Attribution Layer that employs Gradient Boosting Decision Trees and SHAP to quantify feature contributions; and (3) A Reasoning and Diagnosis Layer that leverages an LLM as the agent core. This layer constructs a “numerical-semantic” bridge, combining SHAP attributions with a mechanism knowledge base to generate comprehensive reports containing fault types, root cause analysis, and maintenance suggestions. Experimental results demonstrate that BatteryAgent effectively corrects misclassifications on hard boundary samples, achieving an AUROC of 0.986, which significantly outperforms current state-of-the-art methods. Moreover, the framework extends traditional binary detection to multi-type interpretable diagnosis, offering a new paradigm shift from “passive detection” to “intelligent diagnosis” for battery safety management.

[318] Explaining Why Things Go Where They Go: Interpretable Constructs of Human Organizational Preferences

Emmanuel Fashae, Michael Burke, Leimin Tian, Lingheng Meng, Pamela Carreno-Medrano

Main category: cs.AI

TL;DR: The paper introduces four interpretable constructs for object arrangement preferences and validates them through a questionnaire study, then integrates them into a robot planner.

Details

Motivation: Current robotic systems use latent preference models from human demonstrations, but these lack interpretability about what factors guide human decisions in object arrangement.

Method: 1) Designed and validated a self-report questionnaire measuring four constructs (spatial practicality, habitual convenience, semantic coherence, commonsense appropriateness) through a 63-participant online study. 2) Integrated these constructs into a Monte Carlo Tree Search (MCTS) planner for robotic rearrangement.

Result: The questionnaire study confirmed the psychological distinctiveness of the four constructs and their explanatory power across kitchen and living room scenarios. The MCTS planner guided by participant-derived preferences generated reasonable arrangements that closely aligned with human-generated arrangements.

Conclusion: The work contributes a compact, interpretable formulation of object arrangement preferences and demonstrates how it can be operationalized for robot planning, moving beyond black-box latent models to more transparent preference modeling.

Abstract: Robotic systems for household object rearrangement often rely on latent preference models inferred from human demonstrations. While effective at prediction, these models offer limited insight into the interpretable factors that guide human decisions. We introduce an explicit formulation of object arrangement preferences along four interpretable constructs: spatial practicality (putting items where they naturally fit best in the space), habitual convenience (making frequently used items easy to reach), semantic coherence (placing items together if they are used for the same task or are contextually related), and commonsense appropriateness (putting things where people would usually expect to find them). To capture these constructs, we designed and validated a self-report questionnaire through a 63-participant online study. Results confirm the psychological distinctiveness of these constructs and their explanatory power across two scenarios (kitchen and living room). We demonstrate the utility of these constructs by integrating them into a Monte Carlo Tree Search (MCTS) planner and show that when guided by participant-derived preferences, our planner can generate reasonable arrangements that closely align with those generated by participants. This work contributes a compact, interpretable formulation of object arrangement preferences and a demonstration of how it can be operationalized for robot planning.

[319] GenZ: Foundational models as latent variable generators within traditional statistical models

Marko Jojic, Nebojsa Jojic

Main category: cs.AI

TL;DR: GenZ is a hybrid model that combines LLMs with statistical modeling by discovering interpretable semantic features through an iterative contrastive process, outperforming pure LLM baselines on house price prediction and collaborative filtering tasks.

Details

Motivation: Large language models have broad domain knowledge but often fail to capture dataset-specific patterns critical for prediction tasks. There's a need to bridge foundational models with statistical modeling to leverage both general knowledge and data-specific insights.

Method: Uses an iterative process contrasting groups of items identified via statistical modeling errors to discover semantic feature descriptions. Formulated as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. Prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships.

Result: On house price prediction: 12% median relative error vs. 38% for GPT-5 baseline. On Netflix movie embeddings: predicts collaborative filtering representations with 0.59 cosine similarity from semantic descriptions alone, matching performance that would require ~4000 user ratings through traditional collaborative filtering.

Conclusion: GenZ successfully bridges foundational models and statistical modeling, discovering dataset-specific semantic features that outperform pure LLM approaches. The method reveals domain-specific patterns (architectural details for housing, franchise membership for movies) that diverge from the model’s general domain knowledge alone.

Abstract: We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features. While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks. Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model’s domain understanding. We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships. We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. On house prices, our model achieves 12% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38% error) that relies on the LLM’s general domain knowledge. For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions – matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model’s domain knowledge alone.

[320] A study on constraint extraction and exception exclusion in care worker scheduling

Koki Suenaga, Tomohiro Furuta, Satoshi Ono

Main category: cs.AI

TL;DR: A method using constraint templates to extract facility-specific scheduling constraints from manager interviews for long-term care facilities, with mechanisms to exclude exceptional constraints.

Details

Motivation: Long-term care facilities have varying conditions requiring facility-specific constraint conditions for scheduling, necessitating interviews with managers who create shift schedules.

Method: Uses constraint templates to extract combinations of various components (shift patterns, staff combinations) by changing parameters like number of days and staff members, with mechanisms to exclude exceptional constraints.

Result: Successfully created schedules satisfying all hard constraints and reduced soft constraint violations by circumventing extraction of exceptional constraints.

Conclusion: The proposed constraint template method effectively extracts facility-specific scheduling constraints while filtering out exceptions, enabling creation of practical schedules for long-term care facilities.

Abstract: Technologies for automatically generating work schedules have been extensively studied; however, in long-term care facilities, the conditions vary between facilities, making it essential to interview the managers who create shift schedules to design facility-specific constraint conditions. The proposed method utilizes constraint templates to extract combinations of various components, such as shift patterns for consecutive days or staff combinations. The templates can extract a variety of constraints by changing the number of days and the number of staff members to focus on and changing the extraction focus to patterns or frequency. In addition, unlike existing constraint extraction techniques, this study incorporates mechanisms to exclude exceptional constraints. The extracted constraints can be employed by a constraint programming solver to create care worker schedules. Experiments demonstrated that our proposed method successfully created schedules that satisfied all hard constraints and reduced the number of violations for soft constraints by circumventing the extraction of exceptional constraints.

[321] Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Chonghuan Liu, ZhenDong Liu, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng

Main category: cs.AI

TL;DR: ALE is a comprehensive ecosystem for developing agentic LLMs, featuring ROLL for weight optimization, ROCK for environment management, and iFlow CLI for context engineering. The authors release ROME, an open-source agent trained on 1M+ trajectories, using novel data composition and IPA algorithm for stable long-horizon training.

Details

Motivation: The open-source community lacks a principled, end-to-end ecosystem for developing agentic LLMs that can operate in real-world environments over multiple turns, taking actions, observing outcomes, and iteratively refining artifacts.

Method: ALE infrastructure with three components: ROLL (post-training weight optimization), ROCK (sandbox environment manager for trajectory generation), and iFlow CLI (agent framework for context engineering). Includes data composition protocols for synthesizing complex behaviors and IPA (Interaction-based Policy Alignment) algorithm that assigns credit over semantic interaction chunks rather than individual tokens.

Result: Released ROME (ROME is Obviously an Agentic Model), an open-source agent trained on over one million trajectories. Evaluated on Terminal Bench Pro (improved scale and contamination control) and demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench.

Conclusion: ALE provides an effective foundational infrastructure for agent LLM development, with ROME proving the ecosystem’s effectiveness through strong benchmark performance, addressing the need for principled, end-to-end agent development pipelines.

Abstract: Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.

[322] Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing

Andrii Gamalii, Daniel Górniak, Robert Nowak, Bartłomiej Olber, Krystian Radlak, Jakub Winter

Main category: cs.AI

TL;DR: A semi-automated data annotation pipeline for driving scenarios that combines AI with human expertise to reduce annotation costs and time while ensuring quality.

Details

Motivation: Manual annotation of large-scale, multimodal driving datasets is expensive and time-consuming, especially for Polish driving conditions where such datasets are needed for autonomous vehicle research.

Method: Human-in-the-loop approach using 3D object detection algorithms to generate initial annotations, with iterative model retraining, data anonymization, and domain adaptation techniques.

Result: Substantial time savings while maintaining consistent, high-quality annotations across different sensor modalities, accelerating dataset preparation in standardized format.

Conclusion: The developed pipeline effectively supports the DARTS project by creating large annotated datasets for autonomous vehicle research in Poland, strengthening the technological foundation.

Abstract: This report presents the design and implementation of a semi-automated data annotation pipeline developed within the DARTS project, whose goal is to create a large-scale, multimodal dataset of driving scenarios recorded in Polish conditions. Manual annotation of such heterogeneous data is both costly and time-consuming. To address this challenge, the proposed solution adopts a human-in-the-loop approach that combines artificial intelligence with human expertise to reduce annotation cost and duration. The system automatically generates initial annotations, enables iterative model retraining, and incorporates data anonymization and domain adaptation techniques. At its core, the tool relies on 3D object detection algorithms to produce preliminary annotations. Overall, the developed tools and methodology result in substantial time savings while ensuring consistent, high-quality annotations across different sensor modalities. The solution directly supports the DARTS project by accelerating the preparation of large annotated dataset in the project’s standardized format, strengthening the technological base for autonomous vehicle research in Poland.

[323] Iterative Deployment Improves Planning Skills in LLMs

Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira, Yarin Gal

Main category: cs.AI

TL;DR: Iterative deployment of LLMs with user-curated data from previous deployments leads to significant model property changes, particularly improved planning skills and emergent generalization to longer plans.

Details

Motivation: To understand how iterative deployment of LLMs with user-curated data affects model properties and to explore the implications of this deployment pattern for AI safety and training methodologies.

Method: Testing iterative deployment mechanism on various planning domains, where each new model is fine-tuned on data curated by users from previous models’ deployments, followed by theoretical analysis connecting this process to reinforcement learning.

Result: Substantial improvements in planning skills with later models showing emergent generalization by discovering much longer plans than initial models. Theoretical analysis reveals iterative deployment effectively implements RL training with implicit reward functions.

Conclusion: Iterative deployment creates an implicit RL training loop with safety implications due to undefined reward functions, and offers an alternative training regime using data curation instead of explicit rewards.

Abstract: We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models’ deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

[324] AMAP Agentic Planning Technical Report

Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Tucheng Lin, Xin Li, Ning Guo

Main category: cs.AI

TL;DR: STAgent is a specialized LLM agent for spatio-temporal tasks like POI discovery and itinerary planning, featuring tool interaction capabilities while preserving general performance.

Details

Motivation: To create an agentic LLM specifically designed for complex spatio-temporal reasoning tasks that require interaction with specialized tools while maintaining general capabilities.

Method: Three key contributions: 1) stable tool environment with 10+ domain-specific tools for asynchronous training, 2) hierarchical data curation framework filtering high-quality queries at 1:10,000 ratio, 3) cascaded training recipe with seed SFT, second SFT on high-certainty queries, and RL on low-certainty data.

Result: STAgent shows promising performance on TravelBench while maintaining general capabilities across various benchmarks, demonstrating effective spatio-temporal reasoning with tool interaction.

Conclusion: The proposed STAgent successfully combines specialized spatio-temporal understanding with preserved general capabilities through innovative training methods and tool integration.

Abstract: We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries with a filter ratio of 1:10,000, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.

[325] Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings

Tianzhi He, Farrokh Jazizadeh

Main category: cs.AI

TL;DR: LLM-based BEMS AI agent framework for smart building energy management via natural language interaction, achieving 49-97% accuracy across different task types.

Details

Motivation: To address limitations in existing energy management systems by leveraging LLMs' autonomous data analytics capabilities for context-aware energy management through natural language interaction.

Method: Proposed a three-module framework (perception, central control, action) forming a closed feedback loop. Evaluated prototype using 120 user queries across four real-world residential energy datasets with metrics including latency, functionality, capability, accuracy, and cost-effectiveness.

Result: Promising performance with response accuracy: device control (86%), memory tasks (97%), scheduling/automation (74%), energy analysis (77%), but lower accuracy for complex cost estimation (49%). Framework generalizability demonstrated via ANOVA tests.

Conclusion: The study formalizes assessment of LLM-based BEMS AI agents, identifies future research directions, and highlights trade-offs between response accuracy and computational efficiency.

Abstract: This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype’s performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.

[326] Are Biological Systems More Intelligent Than Artificial Intelligence?

Michael Timothy Bennett

Main category: cs.AI

TL;DR: Biological systems are more intelligent than AI because they delegate adaptation across abstraction layers, while AI has static lower layers that limit adaptability.

Details

Motivation: To understand why biological self-organizing systems appear more intelligent than artificial intelligence by examining how they delegate control and adaptation across abstraction layers.

Method: Develops Stack Theory to model systems as abstraction layers, compares delegation patterns in computational, biological, military, governmental and economic systems, and proves The Law of the Stack theorem about adaptability requirements.

Result: Biological systems are more adaptable because they delegate adaptation throughout their stack, while AI’s static lower layers create inflexibility. Cancer-like failures occur when delegation is inadequate, and robust systems require mission command-style delegation.

Conclusion: Building more intelligent systems requires delegating control like biological systems do, with hybrid agent design involving careful constraint of low-level policies to achieve desired collective behavior while preserving identity.

Abstract: Are biological self-organising systems more intelligent' than artificial intelligence (AI)? If so, why? I explore this through a mathematical lens which frames intelligence in terms of adaptability. I model systems as stacks of abstraction layers (\emph{Stack Theory}) and compare them by how they delegate agentic control down their stacks, illustrating with examples of computational, biological, human military, governmental and economic systems. Contemporary AI rests on a static, human-engineered stack in which lower layers are static during deployment. Put provocatively, static stacks resemble inflexible bureaucracies, adapting only top-down. Biological stacks are more intelligent’ because they delegate adaptation. Formally, I prove a theorem (\emph{The Law of the Stack}) showing adaptability in higher layers requires sufficient adaptability in lower layers. Generalising bio-electric explanations of cancer as isolation from collective informational structures, I explore how cancer-like failures occur in non-biological systems when delegation is inadequate. This helps explain how to build more robust systems, by delegating control like the military doctrine of mission command. It also provides a design perspective on hybrid agents (e.g. organoids, systems involving both humans and AI): hybrid creation is a boundary-condition design problem in which human-imposed constraints prune low-level policy spaces to yield desired collective behaviour while preserving collective identity.

[327] Neurosymbolic Association Rule Mining from Tabular Data

Erkan Karabulut, Paul Groth, Victoria Degeler

Main category: cs.AI

TL;DR: Aerial+ is a neurosymbolic ARM method that uses an under-complete autoencoder to create neural representations of data, extracts rules from these representations, and produces concise, high-quality rule sets with full data coverage.

Details

Motivation: High-dimensional datasets in Association Rule Mining (ARM) lead to rule explosion, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion is a central challenge in ARM research.

Method: Aerial+ uses an under-complete autoencoder to create neural representations capturing feature associations, then extracts rules from these neural representations by exploiting the model’s reconstruction mechanism.

Result: Extensive evaluations on five datasets against seven baselines show Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable ML models, it significantly reduces execution time while maintaining or improving accuracy.

Conclusion: Aerial+ effectively addresses the rule explosion problem in ARM through a neurosymbolic approach, producing compact rule sets that improve both efficiency and performance in downstream applications.

Abstract: Association Rule Mining (ARM) is the task of mining patterns among data features in the form of logical rules, with applications across a myriad of domains. However, high-dimensional datasets often result in an excessive number of rules, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion remains a central challenge in ARM research. To address this, we introduce Aerial+, a novel neurosymbolic ARM method. Aerial+ leverages an under-complete autoencoder to create a neural representation of the data, capturing associations between features. It extracts rules from this neural representation by exploiting the model’s reconstruction mechanism. Extensive evaluations on five datasets against seven baselines demonstrate that Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable machine learning models, Aerial+ significantly reduces execution time while maintaining or improving accuracy.

[328] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim

Main category: cs.AI

TL;DR: The paper proposes a reinforcement learning framework to teach AI agents contextual integrity - reasoning about appropriate information disclosure based on context - using a small synthetic dataset, achieving reduced inappropriate disclosures while maintaining task performance.

Details

Motivation: As autonomous agents make decisions for users, ensuring contextual integrity (determining appropriate information to share for specific tasks) becomes crucial. Current AI systems need better reasoning about context to avoid inappropriate information disclosure.

Method: First prompts LLMs to reason explicitly about contextual integrity. Then develops a reinforcement learning framework that instills contextual reasoning. Uses a small synthetic dataset (~700 examples) with diverse contexts and information disclosure norms for training.

Result: Method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Improvements transfer from synthetic dataset to established CI benchmarks like PrivacyLens with human annotations.

Conclusion: The RL framework effectively teaches AI agents contextual integrity reasoning, demonstrating that even with a small synthetic dataset, models can learn appropriate information disclosure norms that generalize to real-world privacy benchmarks.

Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) – what is the appropriate information to share while carrying out a certain task – becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls. Our code is available at: https://github.com/EricGLan/CI-RL

[329] RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen, Mayi Xu, Haoyang Chen, Xiaohu Li, Xiangyu Zhang, Jianjie Huang, Zheng Wang, Xiaochun Cao, Tieyun Qian

Main category: cs.AI

TL;DR: RAJ attack reveals LRMs can generate harmful reasoning chains despite benign final outputs; PGA framework creates safety datasets from these attacks to improve model safety while preserving reasoning capabilities.

Details

Motivation: Large Reasoning Models have a unique safety vulnerability where their internal reasoning chains may produce harmful content even when final outputs appear safe. Current safety approaches overlook this reasoning-level risk.

Method: 1) Propose Reasoning-Activated Jailbreak (RAJ) attack via concretization - refining malicious prompts to trigger harmful reasoning chains. 2) Develop Principle-Guided Alignment (PGA) framework to construct safety datasets by transforming harmful reasoning traces into safe, constructive responses. 3) Create PGA dataset with 3,989 verified samples.

Result: Fine-tuning LRMs with PGA dataset achieves up to 29.5% improvement in defense success rates across multiple jailbreak benchmarks. The approach effectively defends against reasoning-based attacks while preserving and even enhancing general reasoning capabilities.

Conclusion: The work provides a scalable pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance by targeting reasoning-level vulnerabilities.

Abstract: Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model’s safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model’s general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.

[330] Plan Verification for LLM-Based Embodied Task Completion Agents

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

Main category: cs.AI

TL;DR: LLM-based iterative verification framework improves noisy embodied AI task plans by having a Judge LLM critique actions and a Planner LLM apply revisions, achieving high precision/recall while preserving human error-recovery patterns.

Details

Motivation: LLM-based task plans and human demonstrations for embodied AI often contain noise, unnecessary actions, redundant navigation, and logical errors that reduce policy quality, creating a need for automated plan verification and refinement.

Method: Proposes an iterative verification framework where a Judge LLM critiques action sequences and a Planner LLM applies revisions. Uses natural language prompting for broad generalization across error types (irrelevant actions, contradictions, missing steps). Iteratively refines trajectories to be cleaner and more spatially coherent.

Result: Achieves up to 90% recall and 100% precision on manually annotated TEACh dataset actions across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). 96.5% of sequences require at most three iterations. Improves temporal efficiency and spatial action organization while preserving human error-recovery patterns.

Conclusion: Establishes plan verification as a reliable LLM capability for spatial planning and action refinement, providing a scalable path to higher-quality training data for imitation learning in embodied AI. The method supports future work on robust corrective behavior.

Abstract: Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

[331] TIM-PRM: Verifying multimodal reasoning with Tool-Integrated PRM

Peng Kuang, Xiangxiang Wang, Wentao Liu, Jian Dong, Kaidi Xu

Main category: cs.AI

TL;DR: TIM-PRM is a novel agentic framework that transforms multimodal verification from passive scoring to active, tool-augmented investigation to address visual hallucinations and logical inconsistencies in MLLMs.

Details

Motivation: Current MLLMs suffer from visual hallucinations and logical inconsistencies that standard outcome-based supervision fails to address. Existing Process Reward Models (PRMs) have limitations as scalar scorers or generative critics that suffer from sycophancy, blindly validating flawed hypotheses rather than grounding them in visual reality.

Method: TIM-PRM is a Tool-Integrated Multimodal PRM framework that transforms verification into an active, tool-augmented investigation. It trains models to explicitly plan verification strategies and uses Independent Question Asking to query evidence via external tools, decoupling verification from reasoning context to eliminate confirmation bias. The method is instantiated by curating a high-quality dataset of tool-integrated verification trajectories.

Result: Extensive experiments on VisualProcessBench show that the 8B parameter TIM-PRM model surpasses existing open-source multimodal PRMs and significantly outperforms much larger models like Qwen2.5-72B and InternVL-78B, while offering interpretable insights into the verification process.

Conclusion: TIM-PRM successfully bridges the gap in multimodal verification by transforming it from passive classification to active investigation, effectively addressing visual hallucinations and logical inconsistencies in MLLMs while providing interpretable verification insights.

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive performances in mathematical reasoning, yet they remain vulnerable to visual hallucinations and logical inconsistencies that standard outcome-based supervision fails to mitigate. While Process Reward Models (PRMs) promise step-by-step verification, current approaches typically operate as scalar scorers or generative critics that suffer from sycophancy, blindly validating the flawed hypotheses rather than grounding them in visual reality. To bridge this gap, we introduce TIM-PRM (Tool-Integrated Multimodal PRM), a novel agentic framework that transforms verification from a passive classification task into an active, tool-augmented investigation. TIM-PRM is trained to explicitly plan verification strategies and utilizes a mechanism of Independent Question Asking to query evidence via external tools, effectively decoupling verification from the reasoning context to eliminate confirmation bias. We instantiate this method by curating a high-quality dataset of tool-integrated verification trajectories. Extensive experiments on VisualProcessBench demonstrate that our 8B parameter model surpasses existing open-source multimodal PRMs, significantly outperforming much larger models like Qwen2.5-72B and InternVL-78B, while offering interpretable insights into the verification process.

[332] On measuring grounding and generalizing grounding problems

Daniel Quigley, Eric Maynard

Main category: cs.AI

TL;DR: The paper reframes the symbol grounding problem as an audit across multiple desiderata (authenticity, preservation, faithfulness, robustness, compositionality) rather than a binary judgment, and applies this framework to analyze different grounding modes and case studies.

Details

Motivation: To move beyond the binary "grounded/not grounded" judgment and provide a more nuanced, systematic framework for evaluating symbol grounding that can be applied across different disciplines and approaches.

Method: Develops a multi-dimensional audit framework with five desiderata: authenticity (internal mechanisms acquired through learning/evolution), preservation (atomic meanings intact), faithfulness (correlational and etiological), robustness (graceful degradation), and compositionality (systematic building from parts). Applies this to four grounding modes and three case studies.

Result: Different approaches exhibit different strengths: model-theoretic semantics achieves exact composition but lacks etiological warrant; LLMs show correlational fit and local robustness for linguistic tasks but lack selection-for-success on world tasks; human language meets all desiderata under strong authenticity through evolutionary/developmental acquisition.

Conclusion: By operationalizing philosophical inquiry about representation, the framework provides a common language and technical tool for philosophers, computer scientists, linguists, and mathematicians to systematically investigate grounding and meaning across different approaches.

Abstract: The symbol grounding problem asks how tokens like cat can be about cats, as opposed to mere shapes manipulated in a calculus. We recast grounding from a binary judgment into an audit across desiderata, each indexed by an evaluation tuple (context, meaning type, threat model, reference distribution): authenticity (mechanisms reside inside the agent and, for strong claims, were acquired through learning or evolution); preservation (atomic meanings remain intact); faithfulness, both correlational (realized meanings match intended ones) and etiological (internal mechanisms causally contribute to success); robustness (graceful degradation under declared perturbations); compositionality (the whole is built systematically from the parts). We apply this framework to four grounding modes (symbolic; referential; vectorial; relational) and three case studies: model-theoretic semantics achieves exact composition but lacks etiological warrant; large language models show correlational fit and local robustness for linguistic tasks, yet lack selection-for-success on world tasks without grounded interaction; human language meets the desiderata under strong authenticity through evolutionary and developmental acquisition. By operationalizing a philosophical inquiry about representation, we equip philosophers of science, computer scientists, linguists, and mathematicians with a common language and technical framework for systematic investigation of grounding and meaning.

[333] DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

Xuan Xie, Xuan Wang, Wenjie Wang, Shuai Chen, Wei Lin

Main category: cs.AI

TL;DR: DaGRPO improves GRPO by addressing training instability through sequence-level gradient rectification and off-policy data augmentation, achieving SOTA on reasoning benchmarks.

Details

Motivation: GRPO suffers from training instability and poor sample efficiency due to lack of distinctiveness in on-policy rollouts - homogeneous samples cause gradient conflicts for routine queries, while scarce positive samples hinder optimization for hard queries.

Method: DaGRPO introduces two mechanisms: 1) Sequence-level Gradient Rectification uses fine-grained scoring to dynamically mask low-distinctiveness sample pairs to eliminate gradient conflicts; 2) Off-policy Data Augmentation introduces high-quality anchors to recover training signals for challenging tasks.

Result: Extensive experiments across 9 mathematical reasoning and OOD generalization benchmarks show DaGRPO significantly surpasses existing SFT, GRPO, and hybrid baselines, achieving +4.7% average accuracy gain on math benchmarks and new SOTA performance.

Conclusion: DaGRPO effectively mitigates gradient explosion and accelerates emergence of long-chain reasoning capabilities by addressing the distinctiveness problem in GRPO training.

Abstract: The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness, thereby eradicating gradient conflicts at the source; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks. Extensive experiments across 9 mathematical reasoning and out-of-distribution (OOD) generalization benchmarks demonstrate that DaGRPO significantly surpasses existing SFT, GRPO, and hybrid baselines, achieving new state-of-the-art performance (e.g., a +4.7% average accuracy gain on math benchmarks). Furthermore, in-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.

[334] HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition

Wang Lu, Yao Zhu, Jindong Wang

Main category: cs.AI

TL;DR: HAROOD is a comprehensive benchmark for human activity recognition in out-of-distribution settings, covering 4 OOD scenarios, 6 datasets, 16 methods, and two model selection protocols to evaluate OOD algorithms for HAR.

Details

Motivation: Current HAR research lacks comprehensive evaluation of out-of-distribution algorithms across realistic distribution shifts (cross-person, device, environment, time). Existing approaches only test in limited scenarios, leaving gaps in understanding OOD algorithm effectiveness for HAR.

Method: Proposed HAROOD benchmark with 4 defined OOD scenarios (cross-person, cross-position, cross-dataset, cross-time), built testbed covering 6 datasets, implemented 16 comparative methods using CNN-based and Transformer-based architectures, and established two model selection protocols.

Result: Extensive experiments revealed that no single OOD method consistently outperforms others across all scenarios, highlighting substantial opportunity for advancement in OOD-based HAR research.

Conclusion: HAROOD provides a comprehensive, modular benchmark to facilitate OOD-based HAR research, with findings showing current limitations and opportunities for improvement in handling distribution shifts.

Abstract: Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.

[335] A Geometric Theory of Cognition

Laha Ale

Main category: cs.AI

TL;DR: A unified geometric framework where diverse cognitive processes emerge from Riemannian gradient flow of a cognitive potential on a learned metric manifold.

Details

Motivation: Current cognitive science explains different cognitive capacities through distinct computational theories, lacking a unified mathematical framework that can account for perception, memory, reasoning, and social inference within a single principled approach.

Method: Represent cognitive state as a point on a differentiable manifold with learned Riemannian metric encoding representational constraints, costs, and structural relations. Define scalar cognitive potential combining accuracy, parsimony, utility, and normative requirements. Cognition unfolds as Riemannian gradient flow of this potential.

Result: Dual-process effects (intuitive vs deliberative reasoning) emerge naturally from metric-induced anisotropies creating time-scale separations and geometric phase transitions, without needing modular architectures. Analytical conditions for different regimes derived and demonstrated through simulations of canonical cognitive tasks.

Conclusion: Establishes geometric foundation for cognition that unifies diverse psychological phenomena under single mathematical principle, suggesting guiding principles for developing more general and human-like AI systems.

Abstract: Human cognition spans perception, memory, intuitive judgment, deliberative reasoning, action selection, and social inference, yet these capacities are often explained through distinct computational theories. Here we present a unified mathematical framework in which diverse cognitive processes emerge from a single geometric principle. We represent the cognitive state as a point on a differentiable manifold endowed with a learned Riemannian metric that encodes representational constraints, computational costs, and structural relations among cognitive variables. A scalar cognitive potential combines predictive accuracy, structural parsimony, task utility, and normative or logical requirements. Cognition unfolds as the Riemannian gradient flow of this potential, providing a universal dynamical law from which a broad range of psychological phenomena arise. Classical dual-process effects–rapid intuitive responses and slower deliberative reasoning–emerge naturally from metric-induced anisotropies that generate intrinsic time-scale separations and geometric phase transitions, without invoking modular or hybrid architectures. We derive analytical conditions for these regimes and demonstrate their behavioural signatures through simulations of canonical cognitive tasks. Together, these results establish a geometric foundation for cognition and suggest guiding principles for the development of more general and human-like artificial intelligence systems.

[336] SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai

Main category: cs.AI

TL;DR: SciEvalKit is a unified benchmarking toolkit for evaluating AI models across scientific disciplines, focusing on core scientific intelligence capabilities like multimodal reasoning, symbolic reasoning, and hypothesis generation.

Details

Motivation: Current evaluation platforms are too general-purpose and lack focus on scientific intelligence. There's a need for specialized benchmarking that captures authentic scientific challenges across diverse disciplines to properly assess AI models for science applications.

Method: The toolkit builds expert-grade scientific benchmarks from real-world, domain-specific datasets across six major scientific domains. It features a flexible, extensible evaluation pipeline supporting batch evaluation, custom model/dataset integration, and ensures transparent, reproducible results.

Result: SciEvalKit provides a standardized yet customizable infrastructure for benchmarking scientific foundation models and intelligent agents. It bridges capability-based evaluation with disciplinary diversity and is open-sourced for community-driven development.

Conclusion: SciEvalKit offers a comprehensive solution for evaluating AI models in scientific contexts, addressing the gap between general AI evaluation and specialized scientific intelligence assessment, with potential to accelerate progress in AI4Science.

Abstract: We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.

[337] Multimodal Fact-Checking: An Agent-based Approach

Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

Main category: cs.AI

TL;DR: RW-Post dataset provides real-world multimodal misinformation with annotated reasoning and evidence; AgentFact framework uses specialized agents to emulate human verification workflow for improved fact-checking accuracy and interpretability.

Details

Motivation: Existing multimodal fact-checking systems have limitations in reasoning and evidence utilization due to lack of comprehensive datasets with annotated reasoning processes and verifiable evidence for real-world misinformation.

Method: Introduces RW-Post dataset with real-world multimodal claims aligned with original social media posts, plus detailed reasoning and evidence extracted via LLM-assisted pipeline. Proposes AgentFact framework with five specialized agents for strategy planning, evidence retrieval, visual analysis, reasoning, and explanation generation in iterative workflow.

Result: Extensive experiments show synergy between RW-Post dataset and AgentFact framework substantially improves both accuracy and interpretability of multimodal fact-checking compared to existing approaches.

Conclusion: The combination of high-quality explainable dataset (RW-Post) and agent-based framework (AgentFact) addresses key bottlenecks in multimodal fact-checking by providing comprehensive verification capabilities and human-like reasoning processes.

Abstract: The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.

[338] InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Yu Li, Tian Lan, Zhengling Qi

Main category: cs.AI

TL;DR: InSPO addresses DPO’s limitations by enabling LLMs to condition on alternative responses during training, unlocking self-reflection for better alignment without inference overhead.

Details

Motivation: DPO and its variants have limitations: (1) optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), leading to parameterization artifacts rather than true preferences; (2) treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model's capacity for intrinsic self-reflection untapped.

Method: Proposes Intrinsic Self-reflective Preference Optimization (InSPO), which derives a globally optimal policy that conditions on both context and alternative responses. This formulation is proven superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead.

Result: Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.

Conclusion: InSPO addresses fundamental limitations of DPO by enabling LLMs to leverage comparative information through self-reflection, resulting in more robust alignment without additional inference costs.

Abstract: Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model’s capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.

[339] CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

Main category: cs.AI

TL;DR: CubeBench is a Rubik’s Cube-based benchmark that reveals LLM agents’ critical failure in spatial reasoning, long-horizon planning, and active exploration needed for physical-world deployment.

Details

Motivation: LLM agents excel in digital domains but struggle with physical-world deployment due to inability to form robust spatial mental models, specifically lacking spatial reasoning, long-horizon state tracking, and active exploration capabilities.

Method: Introduces CubeBench, a generative benchmark using Rubik’s Cube with three-tiered diagnostic framework: 1) foundational state tracking with full symbolic info, 2) mental simulation, 3) active exploration with partial visual data. Also provides external solver tools to isolate cognitive bottlenecks.

Result: Leading LLMs show critical limitations with 0.00% pass rate on all long-horizon tasks, revealing fundamental failure in long-term planning. The diagnostic framework successfully isolates cognitive bottlenecks in spatial reasoning and mental simulation.

Conclusion: CubeBench exposes LLM agents’ severe limitations in physical-world cognitive capabilities. The analysis provides key insights for developing more physically-grounded intelligent agents by addressing spatial reasoning, long-horizon planning, and active exploration challenges.

Abstract: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik’s Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.

[340] Physics-Informed Neural Networks for Device and Circuit Modeling: A Case Study of NeuroSPICE

Chien-Ting Tung, Chenming Hu

Main category: cs.AI

TL;DR: NeuroSPICE is a PINN-based circuit simulator that solves DAEs using neural networks instead of traditional numerical methods, offering advantages for design optimization and emerging device simulation.

Details

Motivation: To overcome limitations of conventional SPICE's time-discretized numerical solvers and enable simulation of emerging devices like ferroelectric memories that have highly nonlinear characteristics.

Method: Uses physics-informed neural networks (PINNs) to solve circuit differential-algebraic equations by minimizing equation residuals through backpropagation, modeling waveforms with analytical equations in time domain with exact temporal derivatives.

Result: PINNs don’t outperform SPICE in speed or accuracy during training, but provide unique advantages for surrogate modeling, design optimization, and inverse problems, with flexibility for simulating emerging nonlinear devices.

Conclusion: NeuroSPICE offers a flexible PINN-based alternative to conventional SPICE that enables simulation of complex emerging devices and supports design optimization workflows, despite not matching SPICE’s raw performance metrics.

Abstract: We present NeuroSPICE, a physics-informed neural network (PINN) framework for device and circuit simulation. Unlike conventional SPICE, which relies on time-discretized numerical solvers, NeuroSPICE leverages PINNs to solve circuit differential-algebraic equations (DAEs) by minimizing the residual of the equations through backpropagation. It models device and circuit waveforms using analytical equations in time domain with exact temporal derivatives. While PINNs do not outperform SPICE in speed or accuracy during training, they offer unique advantages such as surrogate models for design optimization and inverse problems. NeuroSPICE’s flexibility enables the simulation of emerging devices, including highly nonlinear systems such as ferroelectric memories.

cs.SD

[341] Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack

Roee Ziv, Raz Lapid, Moshe Sipper

Main category: cs.SD

TL;DR: Universal targeted latent space attack manipulates audio encoder representations to control downstream language model outputs without accessing the LM, achieving high success with minimal distortion.

Details

Motivation: Audio-language models introduce new security vulnerabilities, particularly at the encoder level where adversarial attacks can manipulate latent representations to control downstream language generation outputs.

Method: Proposes a universal targeted latent space attack that learns a single perturbation applicable across different inputs and speakers, operating at the encoder level without requiring access to the language model.

Result: Experiments on Qwen2-Audio-7B-Instruct show consistently high attack success rates with minimal perceptual distortion, demonstrating effectiveness across various inputs and speakers.

Conclusion: Reveals a critical and previously underexplored attack surface at the encoder level of multimodal audio-language systems, highlighting security vulnerabilities in current architectures.

Abstract: Audio-language models combine audio encoders with large language models to enable multimodal reasoning, but they also introduce new security vulnerabilities. We propose a universal targeted latent space attack, an encoder-level adversarial attack that manipulates audio latent representations to induce attacker-specified outputs in downstream language generation. Unlike prior waveform-level or input-specific attacks, our approach learns a universal perturbation that generalizes across inputs and speakers and does not require access to the language model. Experiments on Qwen2-Audio-7B-Instruct demonstrate consistently high attack success rates with minimal perceptual distortion, revealing a critical and previously underexplored attack surface at the encoder level of multimodal systems.

[342] PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Guanjie Huang, Pengfei Zhang, Kai Jiang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: PhyAVBench is a new benchmark for evaluating text-to-audio-video models’ understanding of audio physics principles, featuring 1,000 paired prompts across 6 physics dimensions and 4 scenarios.

Details

Motivation: Existing T2AV models lack physical plausibility in generated sounds due to limited understanding of physical principles. Current benchmarks focus mainly on audio-video synchronization rather than physics grounding.

Method: Created PhyAVBench with 1,000 groups of paired text prompts controlling physical variables that implicitly induce sound variations. Covers 6 audio physics dimensions, 4 daily scenarios, and 50 fine-grained test points. Each prompt grounded by at least 20 newly recorded real-world videos to prevent data leakage.

Result: Developed a comprehensive benchmark (Audio-Physics Sensitivity Test) that systematically evaluates models’ sensitivity to changes in acoustic conditions, ranging from basic sound diffraction to complex phenomena like Helmholtz resonance.

Conclusion: Only models with genuine understanding of audio physics can generate physically consistent audio-visual content. PhyAVBench aims to stimulate progress in this largely unexplored domain of physically plausible T2AV generation.

Abstract: Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models’ sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models’ understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.

[343] AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Xin Li, Peijie Qiu, Hao Wang, Xuanzhao Dong, Yujian Xiong, Anderson Schneider, Yuriy Nevmyvaka, Yalin Wang

Main category: cs.SD

TL;DR: The paper introduces AHA framework to address hallucinations in Large Audio-Language Models by creating a preference dataset through counterfactual hard negative mining, resulting in Qwen-Audio-AHA model that shows significant improvements on both diagnostic and public benchmarks.

Details

Motivation: Large Audio-Language Models suffer from hallucinations where they generate text not grounded in audio input, with specific failure types including Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error.

Method: Introduces AHA (Audio Hallucination Alignment) framework using counterfactual hard negative mining to construct high-quality preference dataset that forces models to distinguish acoustic evidence from linguistic fabrications. Also establishes AHA-Eval diagnostic benchmark for testing temporal reasoning capabilities.

Result: Qwen-Audio-AHA achieves 13.7% improvement on AHA-Eval diagnostic benchmark and shows substantial gains on public benchmarks: 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.

Conclusion: The AHA framework effectively addresses audio grounding failures in LALMs through targeted preference alignment, with improvements that generalize beyond the diagnostic set to public benchmarks, demonstrating the value of fine-grained temporal reasoning training.

Abstract: Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.

[344] Environmental Sound Deepfake Detection Challenge: An Overview

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Main category: cs.SD

TL;DR: The paper introduces EnvSDD, a large-scale dataset for environmental sound deepfake detection, and presents results from the ESDD Challenge at ICASSP 2026.

Details

Motivation: While audio generation models create realistic soundscapes for entertainment applications, they also enable misuse through deceptive audio in fabricated videos and misinformation. Current environmental sound deepfake detection datasets are limited in scale and diversity, creating a need for better detection methods.

Method: The authors created EnvSDD, the first large-scale curated dataset specifically for environmental sound deepfake detection. They then organized the ESDD Challenge as an ICASSP 2026 Grand Challenge to benchmark detection methods using this dataset.

Result: The paper presents an overview of the ESDD Challenge and provides detailed analysis of the challenge results, though specific performance metrics are not detailed in the abstract.

Conclusion: The EnvSDD dataset and associated challenge address critical gaps in environmental sound deepfake detection research, providing a foundation for developing more effective detection methods against audio misuse.

Abstract: Recent progress in audio generation models has made it possible to create highly realistic and immersive soundscapes, which are now widely used in film and virtual-reality-related applications. However, these audio generators also raise concerns about potential misuse, such as producing deceptive audio for fabricated videos or spreading misleading information. Therefore, it is essential to develop effective methods for detecting fake environmental sounds. Existing datasets for environmental sound deepfake detection (ESDD) remain limited in both scale and the diversity of sound categories they cover. To address this gap, we introduced EnvSDD, the first large-scale curated dataset designed for ESDD. Based on EnvSDD, we launched the ESDD Challenge, recognized as one of the ICASSP 2026 Grand Challenges. This paper presents an overview of the ESDD Challenge, including a detailed analysis of the challenge results.

[345] Structuring Concept Space with the Musical Circle of Fifths by Utilizing Music Grammar Based Activations

Tofara Moyo, Panashe Chiurunge

Main category: cs.SD

TL;DR: A neural coding framework called harmonic toroidal codes that implements cognitive operations using dynamical activity on manifolds derived from music theory structures.

Details

Motivation: To create a novel neural coding framework that bridges music theory with cognitive neuroscience, allowing abstract cognitive operations to be implemented through structured dynamical systems inspired by musical structures.

Method: Proposes harmonic toroidal codes - a neural coding framework where cognitive operations are implemented through dynamical activity on manifolds derived from music theoretic structures. Uses toroidal geometry and harmonic relationships from music theory to create structured neural representations.

Result: A theoretical framework that shows how music-theoretic structures (like harmonic relationships and toroidal geometries) can be used to implement cognitive operations through dynamical systems on manifolds.

Conclusion: Music theory provides rich mathematical structures that can be leveraged to create novel neural coding frameworks for implementing cognitive operations through dynamical systems on manifolds, potentially offering new insights into both neuroscience and artificial intelligence.

Abstract: We propose a neural coding framework harmonic toroidal codes in which abstract cognitive operations are implemented through dynamical activity on manifolds derived from music theoretic structures.

[346] AI-Driven Acoustic Voice Biomarker-Based Hierarchical Classification of Benign Laryngeal Voice Disorders from Sustained Vowels

Mohsen Annabestani, Samira Aghadoost, Anais Rameau, Olivier Elemento, Gloria Chia-Yi Chiang

Main category: cs.SD

TL;DR: Hierarchical ML framework for classifying 8 benign voice disorders using acoustic features from vowel phonations, outperforming flat classifiers and pre-trained models.

Details

Motivation: Benign laryngeal voice disorders affect 1 in 5 people and serve as non-invasive indicators of broader physiological dysfunction, requiring automated classification tools for clinical screening and monitoring.

Method: Three-stage hierarchical framework: 1) binary pathological vs non-pathological classification using CNN mel-spectrogram features + 21 acoustic biomarkers; 2) stratification into Healthy, Functional/Psychogenic, Structural/Inflammatory groups using cubic SVM; 3) fine-grained classification incorporating probabilistic outputs from prior stages.

Result: Outperformed flat multi-class classifiers and pre-trained models (META HuBERT, Google HeAR) on 15,132 recordings from 1,261 speakers, improving discrimination of structural/inflammatory vs functional disorders.

Conclusion: Combining deep spectral representations with interpretable acoustic features enhances transparency and clinical alignment, highlighting voice biomarkers as scalable, non-invasive tools for early screening, diagnostic triage, and vocal health monitoring.

Abstract: Benign laryngeal voice disorders affect nearly one in five individuals and often manifest as dysphonia, while also serving as non-invasive indicators of broader physiological dysfunction. We introduce a clinically inspired hierarchical machine learning framework for automated classification of eight benign voice disorders alongside healthy controls, using acoustic features extracted from short, sustained vowel phonations. Experiments utilized 15,132 recordings from 1,261 speakers in the Saarbruecken Voice Database, covering vowels /a/, /i/, and /u/ at neutral, high, low, and gliding pitches. Mirroring clinical triage workflows, the framework operates in three sequential stages: Stage 1 performs binary screening of pathological versus non-pathological voices by integrating convolutional neural network-derived mel-spectrogram features with 21 interpretable acoustic biomarkers; Stage 2 stratifies voices into Healthy, Functional or Psychogenic, and Structural or Inflammatory groups using a cubic support vector machine; Stage 3 achieves fine-grained classification by incorporating probabilistic outputs from prior stages, improving discrimination of structural and inflammatory disorders relative to functional conditions. The proposed system consistently outperformed flat multi-class classifiers and pre-trained self-supervised models, including META HuBERT and Google HeAR, whose generic objectives are not optimized for sustained clinical phonation. By combining deep spectral representations with interpretable acoustic features, the framework enhances transparency and clinical alignment. These results highlight the potential of quantitative voice biomarkers as scalable, non-invasive tools for early screening, diagnostic triage, and longitudinal monitoring of vocal health.

[347] AudioFab: Building A General and Intelligent Audio Factory through Tool Learning

Cheng Zhu, Jing Han, Qianshuai Xue, Kehan Wang, Huan Zhao, Zixing Zhang

Main category: cs.SD

TL;DR: AudioFab is an open-source agent framework that creates a unified ecosystem for audio AI tools, solving fragmentation and configuration issues through modular design and intelligent tool collaboration.

Details

Motivation: Current audio AI tools are fragmented with complex configurations and inefficient tool collaboration, lacking a unified framework to unlock their full potential.

Method: Modular design resolves dependency conflicts, intelligent tool selection with few-shot learning, and user-friendly natural language interface for non-experts.

Result: Provides a stable, extensible platform that simplifies tool integration, improves efficiency and accuracy in complex audio tasks, and enables easier audio AI development.

Conclusion: AudioFab establishes an open, intelligent audio-processing ecosystem as a foundational framework for future audio and multimodal AI research and development.

Abstract: Currently, artificial intelligence is profoundly transforming the audio domain; however, numerous advanced algorithms and tools remain fragmented, lacking a unified and efficient framework to unlock their full potential. Existing audio agent frameworks often suffer from complex environment configurations and inefficient tool collaboration. To address these limitations, we introduce AudioFab, an open-source agent framework aimed at establishing an open and intelligent audio-processing ecosystem. Compared to existing solutions, AudioFab’s modular design resolves dependency conflicts, simplifying tool integration and extension. It also optimizes tool learning through intelligent selection and few-shot learning, improving efficiency and accuracy in complex audio tasks. Furthermore, AudioFab provides a user-friendly natural language interface tailored for non-expert users. As a foundational framework, AudioFab’s core contribution lies in offering a stable and extensible platform for future research and development in audio and multimodal AI. The code is available at https://github.com/SmileHnu/AudioFab.

[348] Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

Euiyeon Kim, Yong-Hoon Choi

Main category: cs.SD

TL;DR: New Mamba2-based model for vocal isolation outperforms Transformers by better capturing long-range temporal dependencies and intermittent vocals, achieving state-of-the-art cSDR of 11.03 dB.

Details

Motivation: Transformer-based models often fail to capture intermittently occurring vocals in music source separation. There's a need for models that can better handle long-range temporal dependencies in audio processing.

Method: Uses Mamba2 (state space model) instead of Transformers, combined with band-splitting strategy and dual-path architecture to efficiently handle long input sequences and capture temporal dependencies.

Result: Achieves state-of-the-art performance with cSDR of 11.03 dB (best reported), substantial gains in uSDR, and stable performance across varying input lengths and vocal occurrence patterns.

Conclusion: Mamba-based models are effective for high-resolution audio processing, demonstrating superiority over Transformers for vocal isolation and opening new directions for broader audio research applications.

Abstract: We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.

[349] SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models

Yuan-Kuei Wu, Yang Liu, Yiteng Huang, Zhaojun Yang, Haibin Wu, Ruizhe Huang, Yi-Te, Hsu, Shuyu Kong, Ming Sun, Florian Metze, Li Wan

Main category: cs.SD

TL;DR: First test-time adaptation framework for generative spoken language models that updates a small subset of parameters during inference using only incoming utterances, improving robustness to acoustic variations without degrading core task performance.

Details

Motivation: Spoken Language Models degrade under real-world acoustic shifts (noise, reverberation, microphone variation), and existing domain adaptation solutions are post-hoc, data-intensive, and slow, limiting practical deployment.

Method: Test-time adaptation framework that updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. It stabilizes token distributions and improves robustness to acoustic variability.

Result: Consistent performance gains across automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench under diverse corruptions. Adaptation is compute- and memory-efficient since only a small fraction of weights are updated.

Conclusion: The framework enhances robustness and adaptability of generative SLMs for real-world speech-driven applications, supporting deployment on resource-constrained platforms through efficient test-time adaptation.

Abstract: Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-intensive, and slow. We introduce the first test-time adaptation (TTA) framework for generative SLMs that process interleaved audio-text prompts. Our method updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. This stabilizes token distributions and improves robustness to acoustic variability without degrading core task accuracy. Evaluated on automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench, our approach yields consistent gains under diverse corruptions. Because adaptation touches only a small fraction of weights, it is both compute- and memory-efficient, supporting deployment on resource-constrained platforms. This work enhances the robustness and adaptability of generative SLMs for real-world speech-driven applications.

[350] STSR: High-Fidelity Speech Super-Resolution via Spectral-Transient Context Modeling

Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv

Main category: cs.SD

TL;DR: STSR is a speech super-resolution framework that operates in the MDCT domain, using spectral-contextual attention and sparse-aware regularization to achieve high-fidelity wideband speech reconstruction with real-time capability.

Details

Motivation: Existing approaches face trade-offs: diffusion models offer high fidelity but are computationally expensive, while efficient time-domain architectures lack explicit frequency representations needed for capturing long-range spectral dependencies and harmonic alignment.

Method: STSR operates in the MDCT domain with two key components: 1) Spectral-Contextual Attention using hierarchical windowing to adaptively aggregate non-local spectral context for harmonic reconstruction up to 48 kHz, and 2) sparse-aware regularization to preserve transient components that are typically suppressed in compressed spectral representations.

Result: STSR consistently outperforms state-of-the-art baselines in both perceptual fidelity and zero-shot generalization, providing a robust, real-time paradigm for high-quality speech restoration.

Conclusion: STSR offers a unified end-to-end framework that reconciles computational efficiency with high-quality harmonic reconstruction and transient preservation, enabling practical deployment of speech super-resolution systems.

Abstract: Speech super-resolution (SR) reconstructs high-fidelity wideband speech from low-resolution inputs-a task that necessitates reconciling global harmonic coherence with local transient sharpness. While diffusion-based generative models yield impressive fidelity, their practical deployment is often stymied by prohibitive computational demands. Conversely, efficient time-domain architectures lack the explicit frequency representations essential for capturing long-range spectral dependencies and ensuring precise harmonic alignment. We introduce STSR, a unified end-to-end framework formulated in the MDCT domain to circumvent these limitations. STSR employs a Spectral-Contextual Attention mechanism that harnesses hierarchical windowing to adaptively aggregate non-local spectral context, enabling consistent harmonic reconstruction up to 48 kHz. Concurrently, a sparse-aware regularization strategy is employed to mitigate the suppression of transient components inherent in compressed spectral representations. STSR consistently outperforms state-of-the-art baselines in both perceptual fidelity and zero-shot generalization, providing a robust, real-time paradigm for high-quality speech restoration.

[351] Audio Super-Resolution with Latent Bridge Models

Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu

Main category: cs.SD

TL;DR: Latent Bridge Models (LBMs) for audio super-resolution achieve state-of-the-art quality by compressing audio to latent space and enabling any-to-any upsampling with frequency-aware training and cascaded architectures.

Details

Motivation: Previous audio SR methods using diffusion and bridge models suffer from sub-optimal quality due to uninformative generation priors. There's a need for better exploitation of LR waveform information and support for higher sampling rates beyond 48kHz.

Method: Compress audio to continuous latent space, design Latent Bridge Models (LBMs) for latent-to-latent generation matching LR-to-HR process. Introduce frequency-aware LBMs that take prior and target frequencies as input for any-to-any upsampling. Design cascaded LBMs with prior augmentation strategies for seamless SR beyond 48kHz.

Result: Achieves state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music on VCTK, ESC-50, Song-Describer benchmarks. Sets first record for any-to-192kHz audio SR.

Conclusion: LBMs effectively exploit LR waveform information through latent-to-latent generation, with frequency-aware training enabling flexible any-to-any upsampling. Cascaded architectures with prior augmentation unlock audio SR beyond 48kHz, providing higher flexibility for audio post-production.

Abstract: Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.

[352] Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation

Shuyang Liu, Yuan Jin, Rui Lin, Shizhe Chen, Junyu Dai, Tao Jiang

Main category: cs.SD

TL;DR: HEAR is a music aesthetic evaluation framework that uses multi-scale representations, hierarchical augmentation, and hybrid training to achieve state-of-the-art performance on the ICASSP 2026 SongEval benchmark.

Details

Motivation: Music aesthetic evaluation is challenging due to the multidimensional nature of musical perception and the scarcity of labeled data, creating a need for robust evaluation frameworks.

Method: HEAR combines: (1) multi-source multi-scale representations for segment- and track-level features, (2) hierarchical augmentation to prevent overfitting, and (3) hybrid training with regression and ranking losses for accurate scoring and top-tier song identification.

Result: HEAR consistently outperforms baselines across all metrics on both tracks of the ICASSP 2026 SongEval benchmark.

Conclusion: HEAR provides an effective framework for music aesthetic evaluation, with code and trained models publicly available for further research and application.

Abstract: Evaluating song aesthetics is challenging due to the multidimensional nature of musical perception and the scarcity of labeled data. We propose HEAR, a robust music aesthetic evaluation framework that combines: (1) a multi-source multi-scale representations module to obtain complementary segment- and track-level features, (2) a hierarchical augmentation strategy to mitigate overfitting, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-tier song identification. Experiments demonstrate that HEAR consistently outperforms the baseline across all metrics on both tracks of the ICASSP 2026 SongEval benchmark. The code and trained model weights are available at https://github.com/Eps-Acoustic-Revolution-Lab/EAR_HEAR.

[353] AUDRON: A Deep Learning Framework with Fused Acoustic Signatures for Drone Type Recognition

Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee, Deepanjali Mishra

Main category: cs.SD

TL;DR: AUDRON is a hybrid deep learning framework that uses acoustic sensing with MFCC, STFT spectrograms, CNNs, recurrent layers, and autoencoders to detect drones from their distinctive propeller sounds, achieving over 97% accuracy in classification tasks.

Details

Motivation: As UAVs become more prevalent across various domains, their misuse raises safety and security concerns. Acoustic sensing provides a low-cost, non-intrusive alternative to vision or radar-based detection since drone propellers generate unique sound patterns that can be detected even when visual or radar sensing is limited.

Method: AUDRON combines multiple acoustic feature representations: Mel-Frequency Cepstral Coefficients (MFCC) and Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs) for spatial feature extraction, recurrent layers for temporal modeling, and autoencoder-based representations. The framework uses feature-level fusion to integrate complementary information before final classification.

Result: AUDRON achieves 98.51% accuracy in binary classification and 97.11% accuracy in multiclass classification, effectively differentiating drone acoustic signatures from background noise while maintaining generalizability across varying conditions.

Conclusion: The hybrid approach combining multiple feature representations with deep learning provides reliable acoustic drone detection, demonstrating potential for deployment in security and surveillance applications where traditional sensing methods may be limited.

Abstract: Unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense. While these systems provide numerous benefits, their misuse raises safety and security concerns, making effective detection mechanisms essential. Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection, as drone propellers generate distinctive sound patterns. This study introduces AUDRON (AUdio-based Drone Recognition Network), a hybrid deep learning framework for drone sound detection, employing a combination of Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs), recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before classification. Experimental evaluation demonstrates that AUDRON effectively differentiates drone acoustic signatures from background noise, achieving high accuracy while maintaining generalizability across varying conditions. AUDRON achieves 98.51 percent and 97.11 percent accuracy in binary and multiclass classification. The results highlight the advantage of combining multiple feature representations with deep learning for reliable acoustic drone detection, suggesting the framework’s potential for deployment in security and surveillance applications where visual or radar sensing may be limited.

[354] Distilled HuBERT for Mobile Speech Emotion Recognition: A Cross-Corpus Validation Study

Saifelden M. Ismail

Main category: cs.SD

TL;DR: Mobile-efficient Speech Emotion Recognition using distilled and quantized DistilHuBERT achieves 92% parameter reduction vs Wav2Vec 2.0 while maintaining competitive accuracy, enabling practical deployment on resource-constrained devices.

Details

Motivation: Transformer-based SER models have high computational demands that constrain mobile deployment. There's a need for efficient models that maintain accuracy while being suitable for resource-constrained mobile devices.

Method: Uses DistilHuBERT (distilled and 8-bit quantized transformer) with 5-fold LOSO cross-validation on IEMOCAP for speaker independence, augmented with cross-corpus training on CREMA-D for generalization enhancement.

Result: Achieves 61.4% Unweighted Accuracy with only 23MB model footprint (91% of full-scale baseline accuracy). Cross-corpus training improves WA by 1.2%, Macro F1 by 1.4%, reduces variance by 32%. RAVDESS evaluation shows 46.64% accuracy but robust arousal detection.

Conclusion: Demonstrates Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on mobile devices. Theatrical nature of acted emotions causes predictions to cluster by arousal level rather than specific categories.

Abstract: Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves approximately 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of the Unweighted Accuracy of a full-scale baseline. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than by specific emotion categories - happiness predictions systematically bleed into anger predictions, and sadness predictions bleed into neutral predictions, due to acoustic saturation when actors prioritize clarity over subtlety. Despite this theatricality effect reducing overall RAVDESS accuracy to 46.64%, the model maintains robust arousal detection with 99% recall for anger, 55% recall for neutral, and 27% recall for sadness. These findings demonstrate a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.

cs.LG

[355] Zero-Trust Agentic Federated Learning for Secure IIoT Defense Systems

Samaresh Kumar Singh, Joyjit Roy, Martin So

Main category: cs.LG

TL;DR: ZTA-FL is a zero-trust federated learning framework for IIoT security that combines TPM-based attestation, explainable Byzantine detection, and privacy-preserving adversarial training to achieve high detection accuracy and robustness against attacks.

Details

Motivation: Recent attacks on critical infrastructure (2021 Oldsmar water treatment, 2023 Danish energy sector) reveal urgent security gaps in Industrial IoT deployments. Existing federated learning frameworks for intrusion detection are vulnerable to Byzantine poisoning attacks and lack robust agent authentication.

Method: Proposes Zero-Trust Agentic Federated Learning (ZTA-FL) with three key components: 1) TPM-based cryptographic attestation with extremely low false acceptance rate, 2) SHAP-weighted aggregation algorithm for explainable Byzantine detection under non-IID conditions, and 3) privacy-preserving on-device adversarial training.

Result: Achieves 97.8% detection accuracy, 93.2% accuracy under 30% Byzantine attacks (outperforming FLAME by 3.1%, p<0.01), 89.3% adversarial robustness, and reduces communication overhead by 34% across three IDS benchmarks (Edge-IIoTset, CIC-IDS2017, UNSW-NB15).

Conclusion: ZTA-FL provides a comprehensive defense-in-depth framework for IIoT security with strong theoretical guarantees, practical performance improvements, and includes code release for reproducibility. The framework addresses critical vulnerabilities in existing FL-based intrusion detection systems.

Abstract: Recent attacks on critical infrastructure, including the 2021 Oldsmar water treatment breach and 2023 Danish energy sector compromises, highlight urgent security gaps in Industrial IoT (IIoT) deployments. While Federated Learning (FL) enables privacy-preserving collaborative intrusion detection, existing frameworks remain vulnerable to Byzantine poisoning attacks and lack robust agent authentication. We propose Zero-Trust Agentic Federated Learning (ZTA-FL), a defense in depth framework combining: (1) TPM-based cryptographic attestation achieving less than 0.0000001 false acceptance rate, (2) a novel SHAP-weighted aggregation algorithm providing explainable Byzantine detection under non-IID conditions with theoretical guarantees, and (3) privacy-preserving on-device adversarial training. Comprehensive experiments across three IDS benchmarks (Edge-IIoTset, CIC-IDS2017, UNSW-NB15) demonstrate that ZTA-FL achieves 97.8 percent detection accuracy, 93.2 percent accuracy under 30 percent Byzantine attacks (outperforming FLAME by 3.1 percent, p less than 0.01), and 89.3 percent adversarial robustness while reducing communication overhead by 34 percent. We provide theoretical analysis, failure mode characterization, and release code for reproducibility.

[356] Network Traffic Analysis with Process Mining: The UPSIDE Case Study

Francesco Vitale, Paolo Palmiero, Massimiliano Rak, Nicola Mazzocca

Main category: cs.LG

TL;DR: Process mining method analyzes gaming network traffic to characterize states, encode them as Petri nets, and classify different video games being played.

Details

Motivation: Online gaming generates large market revenue and complex network traffic, requiring methods to evaluate bandwidth consumption, predict loads, and detect malicious activity. Process mining offers data-driven analysis with model-based insights.

Method: Proposed process mining-based method that: 1) performs unsupervised characterization of different states from gaming network data, 2) encodes states through process mining into interpretable Petri nets, and 3) classifies gaming network traffic to identify different video games.

Result: Applied to UPSIDE case study with Clash Royale and Rocket League data. Achieved 94.02% inter-device similarity (coherence), 174.99% inter-state separation (specificity), and 73.84% AUC for classifying the two video games.

Conclusion: Gaming network behavior can be effectively and interpretably modeled through states represented as Petri nets with good coherence, specificity, and classification accuracy for identifying different video games.

Abstract: Online gaming is a popular activity involving the adoption of complex systems and network infrastructures. The relevance of gaming, which generates large amounts of market revenue, drove research in modeling network devices’ behavior to evaluate bandwidth consumption, predict and sustain high loads, and detect malicious activity. In this context, process mining appears promising due to its ability to combine data-driven analyses with model-based insights. In this paper, we propose a process mining-based method that analyzes gaming network traffic, allowing: unsupervised characterization of different states from gaming network data; encoding such states through process mining into interpretable Petri nets; and classification of gaming network traffic data to identify different video games being played. We apply the method to the UPSIDE case study, involving gaming network data of several devices interacting with two video games: Clash Royale and Rocket League. Results demonstrate that the gaming network behavior can be effectively and interpretably modeled through states represented as Petri nets with sufficient coherence (94.02% inter-device similarity) and specificity (174.99% inter-state separation) while maintaining a good classification accuracy of the two different video games (73.84% AUC).

[357] A Comprehensive Study of Deep Learning Model Fixing Approaches

Hanmo You, Zan Wang, Zishuo Dong, Luanqi Mo, Jianjun Zhao, Junjie Chen

Main category: cs.LG

TL;DR: Large-scale empirical study of 16 state-of-the-art DL model fixing approaches across model, layer, and neuron levels, evaluating fixing effectiveness and impacts on robustness, fairness, and backward compatibility.

Details

Motivation: DL systems are prone to faults like traditional software, and their malfunctioning poses significant risks to users in critical domains like autonomous driving and healthcare. Many fixing approaches exist but need comprehensive evaluation.

Method: Conducted large-scale empirical study of 16 DL model fixing approaches categorized as model-level, layer-level, and neuron-level. Used diverse datasets, model architectures, and application domains within uniform experimental setup to evaluate fixing effectiveness and impacts on robustness, fairness, and backward compatibility.

Result: Model-level approaches show superior fixing effectiveness compared to others. No single approach achieves best fixing performance while improving accuracy and maintaining all other properties simultaneously.

Conclusion: Academia should prioritize research on mitigating side effects of DL model fixing approaches. The findings highlight promising directions for future exploration in DL model repair, emphasizing the need for approaches that balance effectiveness with maintaining other critical properties.

Abstract: Deep Learning (DL) has been widely adopted in diverse industrial domains, including autonomous driving, intelligent healthcare, and aided programming. Like traditional software, DL systems are also prone to faults, whose malfunctioning may expose users to significant risks. Consequently, numerous approaches have been proposed to address these issues. In this paper, we conduct a large-scale empirical study on 16 state-of-the-art DL model fixing approaches, spanning model-level, layer-level, and neuron-level categories, to comprehensively evaluate their performance. We assess not only their fixing effectiveness (their primary purpose) but also their impact on other critical properties, such as robustness, fairness, and backward compatibility. To ensure comprehensive and fair evaluation, we employ a diverse set of datasets, model architectures, and application domains within a uniform experimental setup for experimentation. We summarize several key findings with implications for both industry and academia. For example, model-level approaches demonstrate superior fixing effectiveness compared to others. No single approach can achieve the best fixing performance while improving accuracy and maintaining all other properties. Thus, academia should prioritize research on mitigating these side effects. These insights highlight promising directions for future exploration in this field.

[358] A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios

Haley Rosso, Talea Mayo

Main category: cs.LG

TL;DR: This paper reviews diffusion models for simulation-based inference (SBI), covering mathematical foundations, comparison with normalizing flows, robustness in scientific applications, and open problems.

Details

Motivation: Many scientific problems have intractable likelihood functions, making classical inference methods infeasible. Simulation-based inference (SBI) methods bypass explicit likelihoods by using simulator samples, and diffusion models offer a promising flexible framework for SBI tasks.

Method: The review examines diffusion models from first principles: forward noising processes, reverse-time SDE/ODE, probability flow, and denoising score matching. It explains how conditional scores enable likelihood-free posterior sampling and compares diffusion models with normalizing flows for neural posterior/likelihood estimation.

Result: The review synthesizes methods including Schrodinger-bridge formulations, conditional/sequential posterior samplers, amortized architectures for unstructured data, and inference-time prior adaptation. It emphasizes diffusion-based SBI’s robustness in non-ideal scientific conditions like model misspecification, unstructured/infinite-dimensional observations, and missing data.

Conclusion: Diffusion models provide a robust framework for simulation-based inference, particularly for scientific applications with challenging data conditions. The review identifies open problems and suggests applications in uncertainty quantification for probabilistic geophysical models that could benefit from diffusion-based SBI approaches.

Abstract: For complex simulation problems, inferring parameters of scientific interest often precludes the use of classical likelihood-based techniques due to intractable likelihood functions. Simulation-based inference (SBI) methods forego the need for explicit likelihoods by directly utilizing samples from the simulator to learn posterior distributions over parameters $\mathbfθ$ given observed data $\mathbf{x}_{\text{o}}$. Recent work has brought attention to diffusion models – a type of generative model rooted in score matching and reverse-time stochastic dynamics – as a flexible framework SBI tasks. This article reviews diffusion-based SBI from first principles to applications in practice. We first recall the mathematical foundations of diffusion modeling (forward noising, reverse-time SDE/ODE, probability flow, and denoising score matching) and explain how conditional scores enable likelihood-free posterior sampling. We then examine where diffusion models address pain points of normalizing flows in neural posterior/likelihood estimation and where they introduce new trade-offs (e.g., iterative sampling costs). The key theme of this review is robustness of diffusion-based SBI in non-ideal conditions common to scientific data: misspecification (mismatch between simulated training data and reality), unstructured or infinite-dimensional observations, and missingness. We synthesize methods spanning foundations drawing from Schrodinger-bridge formulations, conditional and sequential posterior samplers, amortized architectures for unstructured data, and inference-time prior adaptation. Throughout, we adopt consistent notation and emphasize conditions and caveats required for accurate posteriors. The review closes with a discussion of open problems with an eye toward applications of uncertainty quantification for probabilistic geophysical models that may benefit from diffusion-based SBI.

[359] Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang

Main category: cs.LG

TL;DR: Skip-BART is an end-to-end generative model that predicts human-like stage lighting from audio music, treating automatic stage lighting control as a generation task rather than classification.

Details

Motivation: Existing ASLC solutions are limited to classifying music into categories and mapping to predefined light patterns, resulting in formulaic outcomes. There's a need for more rational, vivid lighting that learns from professional lighting engineers.

Method: Adapts BART model to take audio music as input and output light hue/intensity, incorporates novel skip connections to enhance music-light relationships, creates first stage lighting dataset, and uses pre-training/transfer learning techniques for limited data.

Result: Skip-BART outperforms conventional rule-based methods across all metrics, shows limited gap compared to real lighting engineers, and the dataset/code/model are publicly available.

Conclusion: This work successfully frames ASLC as a generative task, demonstrates the effectiveness of Skip-BART for human-like stage lighting prediction, and provides valuable resources for future research in this domain.

Abstract: Stage lighting is a vital component in live music performances, shaping an engaging experience for both musicians and audiences. In recent years, Automatic Stage Lighting Control (ASLC) has attracted growing interest due to the high costs of hiring or training professional lighting engineers. However, most existing ASLC solutions only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this gap, this paper presents Skip-BART, an end-to-end model that directly learns from experienced lighting engineers and predict vivid, human-like stage lighting. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method adapts the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame grid. To address the lack of available datasets, we create the first stage lighting dataset, along with several pre-training and transfer learning techniques to improve model training with limited data. We validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting engineers. To support further research, we have made our self-collected dataset, code, and trained model parameters available at https://github.com/RS2002/Skip-BART .

[360] Coordinate Matrix Machine: A Human-level Concept Learning to Classify Very Similar Documents

Amin Sadri, M Maruf Hossain

Main category: cs.LG

TL;DR: CM² is a Green AI model that achieves human-level one-shot learning by focusing on document structure rather than semantics, outperforming traditional methods with minimal data and compute.

Details

Motivation: Humans learn concepts from single examples while ML requires hundreds of samples. Current AI trends rely on massive pre-training and energy-intensive infrastructure, creating a need for efficient, human-like learning approaches.

Method: Coordinate Matrix Machine (CM²) focuses on document structural coordinates rather than semantic vectors. It identifies important structural features that humans would consider, enabling one-shot learning by analyzing document organization patterns.

Result: CM² outperforms traditional vectorizers and complex deep learning models, achieving high accuracy with minimal data (one sample per class). It works efficiently on CPU-only environments with low latency and inherent explainability.

Conclusion: CM² demonstrates that human-level concept learning is achievable through structural intelligence rather than massive data/compute, offering a sustainable Green AI alternative to current Red AI trends.

Abstract: Human-level concept learning argues that humans typically learn new concepts from a single example, whereas machine learning algorithms typically require hundreds of samples to learn a single concept. Our brain subconsciously identifies important features and learns more effectively. \vspace*{6pt} Contribution: In this paper, we present the Coordinate Matrix Machine (CM$^2$). This purpose-built small model augments human intelligence by learning document structures and using this information to classify documents. While modern “Red AI” trends rely on massive pre-training and energy-intensive GPU infrastructure, CM$^2$ is designed as a Green AI solution. It achieves human-level concept learning by identifying only the structural “important features” a human would consider, allowing it to classify very similar documents using only one sample per class. Advantage: Our algorithm outperforms traditional vectorizers and complex deep learning models that require larger datasets and significant compute. By focusing on structural coordinates rather than exhaustive semantic vectors, CM$^2$ offers: 1. High accuracy with minimal data (one-shot learning) 2. Geometric and structural intelligence 3. Green AI and environmental sustainability 4. Optimized for CPU-only environments 5. Inherent explainability (glass-box model) 6. Faster computation and low latency 7. Robustness against unbalanced classes 8. Economic viability 9. Generic, expandable, and extendable

[361] Geometric Scaling of Bayesian Inference in LLMs

Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Modern language models preserve geometric structures enabling Bayesian inference, with value representations organizing along an entropy-aligned axis that correlates with predictive uncertainty.

Details

Motivation: To investigate whether geometric signatures observed in small "wind-tunnel" transformers (low-dimensional value manifolds and orthogonal keys that encode posterior structure) persist in production-grade language models.

Method: Analyzed Pythia, Phi-2, Llama-3, and Mistral families to examine value representations; performed targeted interventions on entropy-aligned axis in Pythia-410M during in-context learning to probe geometry’s role.

Result: Found last-layer value representations organize along single dominant axis correlating with predictive entropy; domain-restricted prompts collapse structure into low-dimensional manifolds; interventions disrupting entropy-aligned axis affect uncertainty geometry but not proportionally degrade Bayesian behavior.

Conclusion: Modern language models preserve geometric substrate enabling Bayesian inference, organizing approximate Bayesian updates along this substrate, with geometry serving as privileged readout of uncertainty rather than singular computational bottleneck.

Abstract: Recent work has shown that small transformers trained in controlled “wind-tunnel’’ settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate – low-dimensional value manifolds and progressively orthogonal keys – that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

[362] Generalized Regularized Evidential Deep Learning Models: Theory and Comprehensive Evaluation

Deep Shankar Pandey, Hyomin Choi, Qi Yu

Main category: cs.LG

TL;DR: The paper analyzes learning-freeze behavior in Evidential Deep Learning (EDL) models due to activation function constraints, proposes a general family of activation functions with regularizers to address this, and validates the approach on multiple classification and restoration tasks.

Details

Motivation: EDL models provide efficient uncertainty quantification but are constrained by Subjective Logic's requirement for non-negative evidence. This leads to activation-dependent learning-freeze behavior where gradients become extremely small for samples in low-evidence regions, hindering learning dynamics.

Method: The authors theoretically characterize the learning-freeze behavior and analyze how different evidential activations influence learning dynamics. They then design a general family of activation functions and corresponding evidential regularizers that enable consistent evidence updates across activation regimes.

Result: Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet), two few-shot classification problems, and a blind face restoration problem validate the theory and demonstrate the effectiveness of the proposed generalized regularized evidential models.

Conclusion: The proposed approach addresses the learning-freeze problem in EDL models through carefully designed activation functions and regularizers, enabling more stable and effective uncertainty-aware learning across diverse applications.

Abstract: Evidential deep learning (EDL) models, based on Subjective Logic, introduce a principled and computationally efficient way to make deterministic neural networks uncertainty-aware. The resulting evidential models can quantify fine-grained uncertainty using learned evidence. However, the Subjective-Logic framework constrains evidence to be non-negative, requiring specific activation functions whose geometric properties can induce activation-dependent learning-freeze behavior: a regime where gradients become extremely small for samples mapped into low-evidence regions. We theoretically characterize this behavior and analyze how different evidential activations influence learning dynamics. Building on this analysis, we design a general family of activation functions and corresponding evidential regularizers that provide an alternative pathway for consistent evidence updates across activation regimes. Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet), two few-shot classification problems, and blind face restoration problem empirically validate the developed theory and demonstrate the effectiveness of the proposed generalized regularized evidential models.

[363] HINTS: Extraction of Human Insights from Time-Series Without External Sources

Sheo Yon Jhin, Noseong Park

Main category: cs.LG

TL;DR: HINTS is a self-supervised framework that extracts latent human factors from time series residuals without external data, using opinion dynamics as inductive bias to improve forecasting accuracy.

Details

Motivation: Human decision-making, emotions, and collective psychology shape financial/economic systems, but current approaches rely on expensive external data sources (news, social media) with high financial, computational, and practical costs.

Method: HINTS uses self-supervised learning to extract latent human factors endogenously from time series residuals. It leverages the Friedkin-Johnsen opinion dynamics model as structural inductive bias to model evolving social influence, memory, and bias patterns, integrating extracted factors into a state-of-the-art backbone model as an attention map.

Result: Experiments on nine real-world and benchmark datasets show HINTS consistently improves forecasting accuracy. Case studies and ablation studies validate interpretability, demonstrating strong semantic alignment between extracted factors and real-world events.

Conclusion: HINTS provides a practical, cost-effective alternative to external data-dependent approaches by endogenously capturing human factors from time series residuals, improving both forecasting accuracy and interpretability.

Abstract: Human decision-making, emotions, and collective psychology are complex factors that shape the temporal dynamics observed in financial and economic systems. Many recent time series forecasting models leverage external sources (e.g., news and social media) to capture human factors, but these approaches incur high data dependency costs in terms of financial, computational, and practical implications. In this study, we propose HINTS, a self-supervised learning framework that extracts these latent factors endogenously from time series residuals without external data. HINTS leverages the Friedkin-Johnsen (FJ) opinion dynamics model as a structural inductive bias to model evolving social influence, memory, and bias patterns. The extracted human factors are integrated into a state-of-the-art backbone model as an attention map. Experimental results using nine real-world and benchmark datasets demonstrate that HINTS consistently improves forecasting accuracy. Furthermore, multiple case studies and ablation studies validate the interpretability of HINTS, demonstrating strong semantic alignment between the extracted factors and real-world events, demonstrating the practical utility of HINTS.

[364] Learning Coupled System Dynamics under Incomplete Physical Constraints and Missing Data

Esha Saha, Hao Wang

Main category: cs.LG

TL;DR: MUSIC is a sparsity-induced multitask neural network framework that integrates partial physical constraints with data-driven learning to recover full-dimensional solutions of coupled systems when physics-constrained and data-informed variables are mutually exclusive.

Details

Motivation: Complex systems are often described by coupled variables, but governing equations are typically available for only one variable while others can only be accessed through data. This mismatch between known physics and observed data poses a fundamental challenge for existing physics-informed machine learning approaches that assume either complete governing equations or full data availability across all variables.

Method: MUSIC uses sparsity-induced multitask neural networks that integrate partial physical constraints with data-driven learning. It employs mesh-free random sampling of training data and sparsity regularization to yield highly compressed models with improved training and evaluation efficiency.

Result: MUSIC accurately learns solutions to complex coupled systems (shock wave solutions, discontinuous solutions, pattern formation solutions) under data-scarce and noisy conditions, consistently outperforming non-sparse formulations.

Conclusion: MUSIC is a flexible and effective approach for modeling partially observed systems with incomplete physical knowledge, addressing the fundamental challenge of mismatched physics and data availability in coupled systems.

Abstract: Advances in data acquisition and computational methods have accelerated the use of differential equation based modelling for complex systems. Such systems are often described by coupled (or more) variables, yet governing equation is typically available for one variable, while the remaining variable can be accessed only through data. This mismatch between known physics and observed data poses a fundamental challenge for existing physics-informed machine learning approaches, which generally assume either complete knowledge of the governing equations or full data availability across all variables. In this paper, we introduce MUSIC (Multitask Learning Under Sparse and Incomplete Constraints), a sparsity induced multitask neural network framework that integrates partial physical constraints with data-driven learning to recover full-dimensional solutions of coupled systems when physics-constrained and data-informed variables are mutually exclusive. MUSIC employs mesh-free (random) sampling of training data and sparsity regularization, yielding highly compressed models with improved training and evaluation efficiency. We demonstrate that MUSIC accurately learns solutions (shock wave solutions, discontinuous solutions, pattern formation solutions) to complex coupled systems under data-scarce and noisy conditions, consistently outperforming non-sparse formulations. These results highlight MUSIC as a flexible and effective approach for modeling partially observed systems with incomplete physical knowledge.

[365] Federated Multi-Task Clustering

Suyan Dai, Gan Sun, Fazeng Li, Xu Tang, Qianqian Wang, Yang Cong

Main category: cs.LG

TL;DR: FMTC is a federated multi-task clustering framework that learns personalized models for heterogeneous clients while capturing shared knowledge via tensor low-rank regularization, eliminating unreliable pseudo-labels.

Details

Motivation: Existing spectral clustering models are centralized and inapplicable to decentralized settings. Current federated learning approaches suffer from poor generalization due to unreliable pseudo-labels and fail to capture correlations among heterogeneous clients.

Method: Two-component framework: 1) Client-side personalized clustering module learns parameterized mapping for robust out-of-sample inference without pseudo-labels; 2) Server-side tensorial correlation module organizes client models into a tensor with low-rank regularization to discover common subspace. Solved via ADMM-based privacy-preserving distributed algorithm.

Result: Extensive experiments on multiple real-world datasets show FMTC significantly outperforms baseline and state-of-the-art federated clustering algorithms.

Conclusion: FMTC effectively addresses federated clustering challenges by learning personalized models while capturing shared structure, achieving superior performance without relying on unreliable pseudo-labels.

Abstract: Spectral clustering has emerged as one of the most effective clustering algorithms due to its superior performance. However, most existing models are designed for centralized settings, rendering them inapplicable in modern decentralized environments. Moreover, current federated learning approaches often suffer from poor generalization performance due to reliance on unreliable pseudo-labels, and fail to capture the latent correlations amongst heterogeneous clients. To tackle these limitations, this paper proposes a novel framework named Federated Multi-Task Clustering (i.e.,FMTC), which intends to learn personalized clustering models for heterogeneous clients while collaboratively leveraging their shared underlying structure in a privacy-preserving manner. More specifically, the FMTC framework is composed of two main components: client-side personalized clustering module, which learns a parameterized mapping model to support robust out-of-sample inference, bypassing the need for unreliable pseudo-labels; and server-side tensorial correlation module, which explicitly captures the shared knowledge across all clients. This is achieved by organizing all client models into a unified tensor and applying a low-rank regularization to discover their common subspace. To solve this joint optimization problem, we derive an efficient, privacy-preserving distributed algorithm based on the Alternating Direction Method of Multipliers, which decomposes the global problem into parallel local updates on clients and an aggregation step on the server. To the end, several extensive experiments on multiple real-world datasets demonstrate that our proposed FMTC framework significantly outperforms various baseline and state-of-the-art federated clustering algorithms.

Zijian Zhao, Sen Li

Main category: cs.LG

TL;DR: Triple-BERT: A centralized single-agent reinforcement learning method using BERT-based networks for large-scale ride-sharing order dispatching, achieving 11.95% improvement over SOTA with better order service and reduced pickup times.

Details

Motivation: Ride-sharing platforms face complex real-time order dispatching with large observation spaces. Existing MARL approaches either lack global coordination (independent MARL) or suffer from dimensionality issues (CTDE MARL), necessitating a scalable centralized solution.

Method: Centralized single-agent RL based on TD3 variant with action decomposition strategy to handle large action space, and BERT-based network with parameter reuse and attention mechanisms to manage extensive observation space and capture complex driver-order relationships.

Result: 11.95% improvement over state-of-the-art methods, with 4.26% increase in served orders and 22.25% reduction in pickup times on real-world Manhattan ride-hailing dataset.

Conclusion: Triple-BERT effectively addresses scalability challenges in ride-sharing order dispatching through centralized RL with BERT-based architecture, demonstrating superior performance over existing MARL approaches.

Abstract: On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers-each with distinct origins and destinations-to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95% improvement over current state-of-the-art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the repository https://github.com/RS2002/Triple-BERT .

[367] Drift-Based Dataset Stability Benchmark

Dominik Soukup, Richard Plný, Daniel Vašata, Tomáš Čejka

Main category: cs.LG

TL;DR: A framework for evaluating dataset stability in network traffic classification using concept drift detection enhanced with ML feature weights, demonstrated on CESNET-TLS-Year22 dataset.

Details

Motivation: ML models for network traffic classification degrade quickly due to data/concept drift from evolving networks and protocols. Current practice involves complete retraining without investigating root causes, assuming good dataset quality, which is not always valid.

Method: Proposes a novel methodology for evaluating dataset stability and a benchmark workflow. The framework uses concept drift detection enhanced with ML feature weights to boost detection performance.

Result: Demonstrated on CESNET-TLS-Year22 dataset, providing initial dataset stability benchmark to describe stability and identify weak points for optimization. Shows optimization impact on created dataset variants using the proposed benchmarking methodology.

Conclusion: The proposed framework enables systematic evaluation of dataset stability in network traffic classification, helping identify root causes of model degradation and guiding dataset optimization efforts.

Abstract: Machine learning (ML) represents an efficient and popular approach for network traffic classification. However, network traffic classification is a challenging domain, and trained models may degrade soon after deployment due to the obsolete datasets and quick evolution of computer networks as new or updated protocols appear. Moreover, significant change in the behavior of a traffic type (and, therefore, the underlying features representing the traffic) can produce a large and sudden performance drop of the deployed model, known as a data or concept drift. In most cases, complete retraining is performed, often without further investigation of root causes, as good dataset quality is assumed. However, this is not always the case and further investigation must be performed. This paper proposes a novel methodology to evaluate the stability of datasets and a benchmark workflow that can be used to compare datasets. The proposed framework is based on a concept drift detection method that also uses ML feature weights to boost the detection performance. The benefits of this work are demonstrated on CESNET-TLS-Year22 dataset. We provide the initial dataset stability benchmark that is used to describe dataset stability and weak points to identify the next steps for optimization. Lastly, using the proposed benchmarking methodology, we show the optimization impact on the created dataset variants.

[368] Neural Optimal Design of Experiment for Inverse Problems

John E. Darges, Babak Maboudi Afkham, Matthias Chung

Main category: cs.LG

TL;DR: NODE is a learning-based framework for optimal experimental design that jointly trains neural reconstruction models with continuous design variables, avoiding traditional bilevel optimization and sparsity regularization.

Details

Motivation: To overcome limitations of classical optimal experimental design approaches that use bilevel optimization and indirect sparsity regularization, which require complex tuning and have high computational complexity.

Method: Jointly trains a neural reconstruction model and continuous design variables (sensor locations, sampling times, measurement angles) in a single optimization loop, optimizing measurement locations directly rather than weighting dense candidate grids.

Result: NODE outperforms baseline approaches across three cases: analytical exponential growth benchmark, MNIST image sampling, and real-world sparse view X-ray CT, demonstrating improved reconstruction accuracy and task-specific performance.

Conclusion: The proposed Neural Optimal Design of Experiments framework provides an effective alternative to classical approaches by enforcing sparsity by design, eliminating l1 tuning, and reducing computational complexity while improving performance.

Abstract: We introduce Neural Optimal Design of Experiments, a learning-based framework for optimal experimental design in inverse problems that avoids classical bilevel optimization and indirect sparsity regularization. NODE jointly trains a neural reconstruction model and a fixed-budget set of continuous design variables representing sensor locations, sampling times, or measurement angles, within a single optimization loop. By optimizing measurement locations directly rather than weighting a dense grid of candidates, the proposed approach enforces sparsity by design, eliminates the need for l1 tuning, and substantially reduces computational complexity. We validate NODE on an analytically tractable exponential growth benchmark, on MNIST image sampling, and illustrate its effectiveness on a real world sparse view X ray CT example. In all cases, NODE outperforms baseline approaches, demonstrating improved reconstruction accuracy and task-specific performance.

[369] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu

Main category: cs.LG

TL;DR: KernelEvolve is an agentic kernel coding framework that automates kernel generation and optimization for DLRM across heterogeneous hardware, reducing development time from weeks to hours while achieving substantial performance improvements.

Details

Motivation: Deep learning recommendation models face three key system challenges: model architecture diversity, kernel primitive diversity, and hardware heterogeneity. These challenges make DLRM training and inference optimization difficult across different hardware platforms.

Method: KernelEvolve uses a graph-based search approach with selection policy, universal operator, fitness function, and termination rule. It operates at multiple programming abstractions (Triton, CuTe DSL to low-level hardware-agnostic languages) and dynamically adapts through retrieval-augmented prompt synthesis.

Result: Achieved 100% pass rate on all 250 KernelBench problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms with 100% correctness. Reduced development time from weeks to hours and achieved substantial performance improvements over PyTorch baselines.

Conclusion: KernelEvolve effectively addresses hardware heterogeneity challenges for DLRM, significantly reduces development time, improves performance, and mitigates programmability barriers for new AI hardware through automated kernel generation.

Abstract: Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

[370] Exploring Cumulative Effects in Survival Data Using Deep Learning Networks

Kang-Chung Yang, Shinsheng Yuan

Main category: cs.LG

TL;DR: CENNSurv is a deep learning method for survival analysis that models cumulative effects of time-dependent exposures with better scalability and interpretability than existing approaches.

Details

Motivation: Existing methods have limitations: conventional spline-based approaches require repeated data transformation and struggle with large datasets, while neural network methods focus on accuracy but lack interpretability of cumulative exposure patterns.

Method: CENNSurv is a novel deep learning approach that captures dynamic risk relationships from time-dependent data, designed to model complex temporal patterns while maintaining interpretability.

Result: On two real-world datasets, CENNSurv revealed a multi-year lagged association between chronic environmental exposure and survival outcomes, and identified critical short-term behavioral shifts prior to subscription lapse.

Conclusion: CENNSurv provides researchers with a practical tool for studying cumulative effects that offers improved scalability and interpretable insights into complex temporal patterns.

Abstract: In epidemiological research, modeling the cumulative effects of time-dependent exposures on survival outcomes presents a challenge due to their intricate temporal dynamics. Conventional spline-based statistical methods, though effective, require repeated data transformation for each spline parameter tuning, with survival analysis computations relying on the entire dataset, posing difficulties for large datasets. Meanwhile, existing neural network-based survival analysis methods focus on accuracy but often overlook the interpretability of cumulative exposure patterns. To bridge this gap, we introduce CENNSurv, a novel deep learning approach that captures dynamic risk relationships from time-dependent data. Evaluated on two diverse real-world datasets, CENNSurv revealed a multi-year lagged association between chronic environmental exposure and a critical survival outcome, as well as a critical short-term behavioral shift prior to subscription lapse. This demonstrates CENNSurv’s ability to model complex temporal patterns with improved scalability. CENNSurv provides researchers studying cumulative effects a practical tool with interpretable insights.

[371] A Granular Grassmannian Clustering Framework via the Schubert Variety of Best Fit

Karim Salta, Michael Kirby, Chris Peterson

Main category: cs.LG

TL;DR: SVBF-LBG: A subspace clustering algorithm using Schubert Variety prototypes instead of subspace means, improving cluster purity while maintaining geometric structure.

Details

Motivation: Traditional subspace clustering uses geometric representatives like means/medians on Grassmann/flag manifolds, but these may not optimally represent clusters. Need better prototypes that preserve mathematical structure while improving clustering performance.

Method: Introduces Schubert Variety of Best Fit (SVBF) as trainable prototype - a subspace that aims to intersect each cluster member in at least one fixed direction. Integrates SVBF into Linde-Buzo-Grey (LBG) pipeline for subspace clustering.

Result: SVBF-LBG achieves improved cluster purity on synthetic, image, spectral, and video action datasets compared to traditional methods, while retaining mathematical structure needed for downstream analysis.

Conclusion: SVBF prototypes provide superior geometric representatives for subspace clustering, offering both performance improvements and mathematical interpretability for subsequent analysis tasks.

Abstract: In many classification and clustering tasks, it is useful to compute a geometric representative for a dataset or a cluster, such as a mean or median. When datasets are represented by subspaces, these representatives become points on the Grassmann or flag manifold, with distances induced by their geometry, often via principal angles. We introduce a subspace clustering algorithm that replaces subspace means with a trainable prototype defined as a Schubert Variety of Best Fit (SVBF) - a subspace that comes as close as possible to intersecting each cluster member in at least one fixed direction. Integrated in the Linde-Buzo-Grey (LBG) pipeline, this SVBF-LBG scheme yields improved cluster purity on synthetic, image, spectral, and video action data, while retaining the mathematical structure required for downstream analysis.

[372] Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics

Bin Xu, Ayan Banerjee, Sandeep Gupta

Main category: cs.LG

TL;DR: MERINDA is an FPGA-accelerated framework for model recovery that achieves 114× lower energy, 28× smaller memory, and 1.68× faster training than GPU implementations while maintaining accuracy.

Details

Motivation: Physical AI at the edge requires hardware-efficient learning for autonomous systems to understand real-world dynamics in real time. Current model recovery methods rely on Neural ODE formulations that are computationally expensive and difficult to accelerate on edge hardware with strict latency, compute, and power constraints.

Method: MERINDA replaces expensive Neural ODE components with a hardware-friendly formulation combining: (1) GRU-based discretized dynamics, (2) dense inverse-ODE layers, (3) sparsity-driven dropout, and (4) lightweight ODE solvers. The computation is structured for streaming parallelism to enable full parallelization of critical kernels on FPGAs.

Result: Across four benchmark nonlinear dynamical systems, MERINDA delivers 114× lower energy (434J vs. 49,375J), 28× smaller memory footprint (214MB vs. 6,118MB), and 1.68× faster training compared to GPU implementations, while matching state-of-the-art model recovery accuracy.

Conclusion: MERINDA demonstrates that accurate, explainable model recovery can be brought to the edge for real-time monitoring of autonomous systems, making physical AI practical on resource-constrained devices.

Abstract: Physical AI at the edge – enabling autonomous systems to understand and predict real-world dynamics in real time – requires hardware-efficient learning and inference. Model recovery (MR), which identifies governing equations from sensor data, is a key primitive for safe and explainable monitoring in mission-critical autonomous systems operating under strict latency, compute, and power constraints. However, state-of-the-art MR methods (e.g., EMILY and PINN+SR) rely on Neural ODE formulations that require iterative solvers and are difficult to accelerate efficiently on edge hardware. We present \textbf{MERINDA} (Model Recovery in Reconfigurable Dynamic Architecture), an FPGA-accelerated MR framework designed to make physical AI practical on resource-constrained devices. MERINDA replaces expensive Neural ODE components with a hardware-friendly formulation that combines (i) GRU-based discretized dynamics, (ii) dense inverse-ODE layers, (iii) sparsity-driven dropout, and (iv) lightweight ODE solvers. The resulting computation is structured for streaming parallelism, enabling critical kernels to be fully parallelized on the FPGA. Across four benchmark nonlinear dynamical systems, MERINDA delivers substantial gains over GPU implementations: \textbf{114$\times$ lower energy} (434~~J vs.\ 49{,}375~~J), \textbf{28$\times$ smaller memory footprint} (214~~MB vs.\ 6{,}118~~MB), and \textbf{1.68$\times$ faster training}, while matching state-of-the-art model-recovery accuracy. These results demonstrate that MERINDA can bring accurate, explainable MR to the edge for real-time monitoring of autonomous systems.

[373] Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

Ankit Kanwar, Dominik Wagner, Luke Ong

Main category: cs.LG

TL;DR: SB-TRPO is a trust-region RL algorithm that balances reward maximization with strict safety constraints by adaptively biasing policy updates toward constraint satisfaction while maintaining reward improvement.

Details

Motivation: Current RL methods for safety-critical domains struggle to achieve both near-zero safety violations and good reward performance when dealing with hard constraints. Lagrangian methods often fail to ensure safety, while projection-based methods sacrifice reward performance.

Method: SB-TRPO performs trust-region updates using a convex combination of natural policy gradients for both cost (safety) and reward. This ensures a fixed fraction of optimal cost reduction at each step while still seeking reward improvement.

Result: Experiments on Safety Gymnasium tasks show SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.

Conclusion: SB-TRPO provides a theoretically-grounded approach for hard-constrained RL that effectively balances safety and performance, with theoretical guarantees of local progress toward safety and reward improvement when gradients are aligned.

Abstract: Reinforcement learning (RL) in safety-critical domains requires agents to maximise rewards while strictly adhering to safety constraints. Existing approaches, such as Lagrangian and projection-based methods, often either fail to ensure near-zero safety violations or sacrifice reward performance in the face of hard constraints. We propose Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a new trust-region algorithm for hard-constrained RL. SB-TRPO adaptively biases policy updates towards constraint satisfaction while still seeking reward improvement. Concretely, it performs trust-region updates using a convex combination of the natural policy gradients of cost and reward, ensuring a fixed fraction of optimal cost reduction at each step. We provide a theoretical guarantee of local progress towards safety, with reward improvement when gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks show that SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.

[374] FineFT: Efficient and Risk-Aware Ensemble Reinforcement Learning for Futures Trading

Molei Qin, Xinyu Cai, Yewen Li, Haochong Xia, Chuqiao Zong, Shuo Sun, Xinrun Wang, Bo An

Main category: cs.LG

TL;DR: FineFT is a three-stage ensemble RL framework for crypto futures trading that addresses high leverage challenges through selective updates, profitability filtering, and VAE-based risk management to handle market volatility and black swan events.

Details

Motivation: Existing RL methods for quantitative trading focus on spot markets and fail to address two key challenges in futures trading: 1) high leverage amplifies reward fluctuations, making training unstable and difficult to converge, and 2) lack of self-awareness about capability boundaries exposes systems to significant losses during black swan events like COVID-19.

Method: FineFT uses a three-stage ensemble RL framework: Stage I - ensemble Q learners are selectively updated using ensemble TD errors to improve convergence; Stage II - filters Q-learners based on profitability and trains VAEs on market states to identify capability boundaries; Stage III - chooses between filtered ensemble and conservative policy using trained VAEs to maintain profitability while mitigating risk with new market states.

Result: FineFT outperforms 12 state-of-the-art baselines on crypto futures in high-frequency trading with 5x leverage across 6 financial metrics, reducing risk by more than 40% while achieving superior profitability compared to the runner-up. Visualization shows agents specialize in distinct market dynamics, and ablation studies confirm VAE routing reduces maximum drawdown while selective update improves convergence and performance.

Conclusion: The proposed FineFT framework successfully addresses the challenges of futures trading with high leverage through stable training and proper risk management, demonstrating effective handling of market volatility and black swan events while maintaining profitability.

Abstract: Futures are contracts obligating the exchange of an asset at a predetermined date and price, notable for their high leverage and liquidity and, therefore, thrive in the Crypto market. RL has been widely applied in various quantitative tasks. However, most methods focus on the spot and could not be directly applied to the futures market with high leverage because of 2 challenges. First, high leverage amplifies reward fluctuations, making training stochastic and difficult to converge. Second, prior works lacked self-awareness of capability boundaries, exposing them to the risk of significant loss when encountering new market state (e.g.,a black swan event like COVID-19). To tackle these challenges, we propose the Efficient and Risk-Aware Ensemble Reinforcement Learning for Futures Trading (FineFT), a novel three-stage ensemble RL framework with stable training and proper risk management. In stage I, ensemble Q learners are selectively updated by ensemble TD errors to improve convergence. In stage II, we filter the Q-learners based on their profitabilities and train VAEs on market states to identify the capability boundaries of the learners. In stage III, we choose from the filtered ensemble and a conservative policy, guided by trained VAEs, to maintain profitability and mitigate risk with new market states. Through extensive experiments on crypto futures in a high-frequency trading environment with high fidelity and 5x leverage, we demonstrate that FineFT outperforms 12 SOTA baselines in 6 financial metrics, reducing risk by more than 40% while achieving superior profitability compared to the runner-up. Visualization of the selective update mechanism shows that different agents specialize in distinct market dynamics, and ablation studies certify routing with VAEs reduces maximum drawdown effectively, and selective update improves convergence and performance.

[375] A Survey on Graph Neural Networks for Fraud Detection in Ride Hailing Platforms

Kanishka Hewageegana, Janani Harischandra, Nipuna Senanayake, Gihan Danansuriya, Kavindu Hapuarachchi, Pooja Illangarathne

Main category: cs.LG

TL;DR: This paper reviews GNN-based fraud detection in ride-hailing platforms, comparing models, addressing class imbalance and fraud camouflage, and identifying research gaps for future improvements.

Details

Motivation: The motivation is to enhance fraud detection in ride-hailing platforms by leveraging Graph Neural Networks, as fraud incidents are significant problems in the rapidly evolving ride-hailing industry that require advanced detection methods.

Method: The paper conducts a comparative analysis of various GNN models for fraud detection, examines approaches to address class imbalance and fraudulent camouflage, and provides a structured overview of GNN architectures and methodologies applied to anomaly detection in ride-hailing contexts.

Result: The research identifies significant methodological progress in GNN-based fraud detection, compares the effectiveness of different models, and highlights existing gaps in current approaches, particularly in handling real-world challenges like class imbalance and fraud camouflage.

Conclusion: The paper concludes that while GNNs show promise for ride-hailing fraud detection, further exploration is needed for real-world applicability and technical improvements to enhance fraud detection strategies in this rapidly evolving industry.

Abstract: This study investigates fraud detection in ride hailing platforms through Graph Neural Networks (GNNs),focusing on the effectiveness of various models. By analyzing prevalent fraudulent activities, the research highlights and compares the existing work related to fraud detection which can be useful when addressing fraudulent incidents within the online ride hailing platforms. Also, the paper highlights addressing class imbalance and fraudulent camouflage. It also outlines a structured overview of GNN architectures and methodologies applied to anomaly detection, identifying significant methodological progress and gaps. The paper calls for further exploration into real-world applicability and technical improvements to enhance fraud detection strategies in the rapidly evolving ride-hailing industry.

[376] TabMixNN: A Unified Deep Learning Framework for Structural Mixed Effects Modeling on Tabular Data

Deniz Akdemir

Main category: cs.LG

TL;DR: TabMixNN is a PyTorch framework that combines classical mixed-effects models with neural networks for tabular data analysis, supporting hierarchical structures and diverse outcome types.

Details

Motivation: There's a growing need for methods that can handle hierarchical data structures while supporting diverse outcome types (regression, classification, multitask learning) and maintaining interpretability of classical mixed-effects models.

Method: Three-stage modular architecture: (1) mixed-effects encoder with variational random effects and flexible covariance structures, (2) backbone architectures including GSEM and spatial-temporal manifold networks, (3) outcome-specific prediction heads. Features R-style formula interface, DAG constraints for causal learning, SPDE kernels for spatial modeling, and interpretability tools.

Result: Demonstrated flexibility through applications to longitudinal data analysis, genomic prediction, and spatial-temporal modeling.

Conclusion: TabMixNN provides a unified interface for researchers to leverage deep learning while maintaining the interpretability and theoretical grounding of classical mixed-effects models.

Abstract: We present TabMixNN, a flexible PyTorch-based deep learning framework that synthesizes classical mixed-effects modeling with modern neural network architectures for tabular data analysis. TabMixNN addresses the growing need for methods that can handle hierarchical data structures while supporting diverse outcome types including regression, classification, and multitask learning. The framework implements a modular three-stage architecture: (1) a mixed-effects encoder with variational random effects and flexible covariance structures, (2) backbone architectures including Generalized Structural Equation Models (GSEM) and spatial-temporal manifold networks, and (3) outcome-specific prediction heads supporting multiple outcome families. Key innovations include an R-style formula interface for accessibility, support for directed acyclic graph (DAG) constraints for causal structure learning, Stochastic Partial Differential Equation (SPDE) kernels for spatial modeling, and comprehensive interpretability tools including SHAP values and variance decomposition. We demonstrate the framework’s flexibility through applications to longitudinal data analysis, genomic prediction, and spatial-temporal modeling. TabMixNN provides a unified interface for researchers to leverage deep learning while maintaining the interpretability and theoretical grounding of classical mixed-effects models.

[377] Improved Bounds for Private and Robust Alignment

Wenqian Weng, Yi He, Xingyu Zhou

Main category: cs.LG

TL;DR: Theoretical analysis of private and robust alignment of language models with bounds on suboptimality gaps in offline/online settings under privacy constraints and adversarial corruption.

Details

Motivation: To establish theoretical foundations for aligning language models under both privacy constraints (protecting sensitive preference data) and adversarial corruption (robustness to manipulated labels), analyzing the interplay between these two challenges.

Method: Theoretical analysis using uniform convergence guarantees for log loss and square loss under privacy and corruption. Examines two interplays: privacy-first and corruption-first. Uses MLE-style algorithm with log loss for privacy-only setting, and analyzes existing offline algorithms for joint privacy-and-corruption setting.

Result: For privacy-only setting: log loss with MLE achieves near-optimal rates. For joint setting: existing offline algorithms provide stronger guarantees than previously known. First results for private and robust online alignment. New uniform convergence guarantees for log loss and square loss under privacy/corruption.

Conclusion: The paper establishes theoretical bounds for private and robust alignment of language models, showing that log loss with MLE can achieve near-optimal rates under privacy constraints, and that existing algorithms provide stronger joint guarantees. The new uniform convergence results have broad applicability in learning theory and statistics.

Abstract: In this paper, we study the private and robust alignment of language models from a theoretical perspective by establishing upper bounds on the suboptimality gap in both offline and online settings. We consider preference labels subject to privacy constraints and/or adversarial corruption, and analyze two distinct interplays between them: privacy-first and corruption-first. For the privacy-only setting, we show that log loss with an MLE-style algorithm achieves near-optimal rates, in contrast to conventional wisdom. For the joint privacy-and-corruption setting, we first demonstrate that existing offline algorithms in fact provide stronger guarantees – simultaneously in terms of corruption level and privacy parameters – than previously known, which further yields improved bounds in the corruption-only regime. In addition, we also present the first set of results for private and robust online alignment. Our results are enabled by new uniform convergence guarantees for log loss and square loss under privacy and corruption, which we believe have broad applicability across learning theory and statistics.

[378] MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling

Mahdi Karami, Ali Behrouz, Peilin Zhong, Razvan Pascanu, Vahab Mirrokni

Main category: cs.LG

TL;DR: MS-SSM introduces a multi-scale state-space model framework that captures both fine-grained and coarse patterns across multiple resolutions, improving memory efficiency and long-range modeling with dynamic scale mixing.

Details

Motivation: Traditional SSMs have limited effective memory and struggle with multi-scale dependencies needed for complex structures in time series, images, and natural language. They require larger state sizes for better recall but still miss hierarchical patterns.

Method: Proposes a multi-scale SSM framework that represents sequence dynamics across multiple resolutions, with specialized state-space dynamics for each resolution. Includes an input-dependent scale-mixer for dynamic information fusion across resolutions.

Result: MS-SSM consistently outperforms prior SSM-based models on benchmarks including Long Range Arena, hierarchical reasoning, time series classification, and image recognition, while maintaining computational efficiency.

Conclusion: Multi-resolution processing in state-space architectures significantly improves sequence modeling, particularly for long-range and hierarchical tasks, demonstrating the benefits of capturing both fine-grained and global patterns.

Abstract: State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling. They rely on linear recurrences to integrate information over time, enabling fast inference, parallelizable training, and control over recurrence stability. However, traditional SSMs often suffer from limited effective memory, requiring larger state sizes for improved recall. Moreover, existing SSMs struggle to capture multi-scale dependencies, which are essential for modeling complex structures in time series, images, and natural language. This paper introduces a multi-scale SSM framework that addresses these limitations by representing sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics. By capturing both fine-grained, high-frequency patterns and coarse, global trends, MS-SSM enhances memory efficiency and long-range modeling. We further introduce an input-dependent scale-mixer, enabling dynamic information fusion across resolutions. The proposed approach significantly improves sequence modeling, particularly in long-range and hierarchical tasks, while maintaining computational efficiency. Extensive experiments on benchmarks, including Long Range Arena, hierarchical reasoning, time series classification, and image recognition, demonstrate that MS-SSM consistently outperforms prior SSM-based models, highlighting the benefits of multi-resolution processing in state-space architectures.

[379] Exploiting the Prior of Generative Time Series Imputation

YuYang Miao, Chang Li, Zehua Chen

Main category: cs.LG

TL;DR: Bridge-TS introduces a data-to-data generation approach for time series imputation using expert priors and compositional priors from pretrained models to improve accuracy over previous generative methods.

Details

Motivation: Previous generative models for time series imputation (like diffusion models and Schrodinger bridges) use uninformative priors (Gaussian noise or linear interpolation), which increases generation burden and limits imputation accuracy. The paper aims to improve prior design for better generative imputation.

Method: Bridge-TS builds a data-to-data generation process with two novel designs: 1) Expert prior - uses a pretrained transformer-based module to fill missing values with deterministic estimation as prior; 2) Compositional priors - combines multiple pretrained models’ estimations in the data-to-data generation process for compositional priors-to-target imputation.

Result: Experiments on benchmark datasets (ETT, Exchange, Weather) show Bridge-TS achieves new state-of-the-art imputation accuracy in terms of mean square error (MSE) and mean absolute error (MAE), demonstrating superiority of improved prior design.

Conclusion: Improving prior design with expert and compositional priors significantly enhances generative time series imputation accuracy, establishing Bridge-TS as a superior approach for data-to-data generation in time series imputation tasks.

Abstract: Time series imputation, i.e., filling the missing values of a time recording, finds various applications in electricity, finance, and weather modelling. Previous methods have introduced generative models such as diffusion probabilistic models and Schrodinger bridge models to conditionally generate the missing values from Gaussian noise or directly from linear interpolation results. However, as their prior is not informative to the ground-truth target, their generation process inevitably suffer increased burden and limited imputation accuracy. In this work, we present Bridge-TS, building a data-to-data generation process for generative time series imputation and exploiting the design of prior with two novel designs. Firstly, we propose expert prior, leveraging a pretrained transformer-based module as an expert to fill the missing values with a deterministic estimation, and then taking the results as the prior of ground truth target. Secondly, we explore compositional priors, utilizing several pretrained models to provide different estimation results, and then combining them in the data-to-data generation process to achieve a compositional priors-to-target imputation process. Experiments conducted on several benchmark datasets such as ETT, Exchange, and Weather show that Bridge-TS reaches a new record of imputation accuracy in terms of mean square error and mean absolute error, demonstrating the superiority of improving prior for generative time series imputation.

[380] Trellis: Learning to Compress Key-Value Memory in Attention Models

Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni

Main category: cs.LG

TL;DR: Trellis is a novel Transformer architecture with bounded memory that dynamically compresses KV cache at test time using a two-pass recurrent compression mechanism, outperforming baselines especially on long sequences.

Details

Motivation: Transformers suffer from quadratic computational complexity and ever-growing KV cache memory requirements, which limits their efficiency and scalability for long-context applications.

Method: Replaces standard KV cache with fixed-size memory and trains a two-pass recurrent compression mechanism that uses online gradient descent with a forget gate to dynamically compress and update key-value memory at test time.

Result: Extensive experiments show Trellis outperforms strong baselines on language modeling, common-sense reasoning, recall-intensive tasks, and time series, with performance gains increasing as sequence length grows.

Conclusion: Trellis demonstrates potential for long-context applications by providing bounded memory Transformer architecture with dynamic KV cache compression that maintains performance while reducing memory requirements.

Abstract: Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.

[381] Flow Matching Neural Processes

Hussen Abu Hamad, Dan Rosenbaum

Main category: cs.LG

TL;DR: A new Neural Process model using flow matching for conditional distribution prediction, offering simple implementation, ODE-based sampling without auxiliary conditioning, and controllable accuracy-speed tradeoff.

Details

Motivation: To develop a simpler and more effective Neural Process model that can perform conditional sampling without complex auxiliary conditioning methods, while providing better performance and flexible tradeoffs between accuracy and computational cost.

Method: Proposes a Neural Process model based on flow matching, a generative modeling approach. The model provides amortized predictions of conditional distributions over arbitrary points and uses an ODE solver for sampling from conditional distributions without requiring auxiliary conditioning methods.

Result: Outperforms previous state-of-the-art neural process methods on multiple benchmarks including synthetic 1D Gaussian processes data, 2D images, and real-world weather data.

Conclusion: The flow matching-based Neural Process model offers a simpler implementation, effective conditional sampling via ODE solver, controllable accuracy-speed tradeoff, and superior performance across diverse benchmarks compared to existing methods.

Abstract: Neural processes (NPs) are a class of models that learn stochastic processes directly from data and can be used for inference, sampling and conditional sampling. We introduce a new NP model based on flow matching, a generative modeling paradigm that has demonstrated strong performance on various data modalities. Following the NP training framework, the model provides amortized predictions of conditional distributions over any arbitrary points in the data. Compared to previous NP models, our model is simple to implement and can be used to sample from conditional distributions using an ODE solver, without requiring auxiliary conditioning methods. In addition, the model provides a controllable tradeoff between accuracy and running time via the number of steps in the ODE solver. We show that our model outperforms previous state-of-the-art neural process methods on various benchmarks including synthetic 1D Gaussian processes data, 2D images, and real-world weather data.

[382] Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Yue Guan, Changming Yu, Shihan Fang, Weiming Hu, Zaifeng Pan, Zheng Wang, Zihan Liu, Yangjie Zhou, Yufei Ding, Minyi Guo, Jingwen Leng

Main category: cs.LG

TL;DR: Yggdrasil is a co-designed speculative decoding system that achieves latency-optimal LLM inference through context-aware tree drafting and compiler-friendly execution, delivering up to 3.98× speedup over state-of-the-art baselines.

Details

Motivation: Existing speculative decoding systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions, limiting their effectiveness in parallel token generation and verification.

Method: Yggdrasil introduces three key innovations: 1) equal-growth tree structure for static graph compatibility, 2) latency-aware optimization objective for draft selection, and 3) stage-based scheduling to reduce overhead. The system supports unmodified LLMs through context-aware tree drafting and compiler-friendly execution.

Result: Yggdrasil achieves up to 3.98× speedup over state-of-the-art baselines across multiple hardware setups, demonstrating significant performance improvements in LLM inference through latency-optimal speculative decoding.

Conclusion: Yggdrasil successfully addresses the performance limitations of existing speculative decoding systems by co-designing the algorithm and runtime, enabling efficient parallel token generation and verification while maintaining compatibility with unmodified LLMs.

Abstract: Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

[383] Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining

Ruizhe Huang, Kexuan Zhang, Yihao Fang, Baifeng Yu

Main category: cs.LG

TL;DR: Infini-attention enables small language models to handle long contexts by compressing past segments into memory, improving retrieval accuracy by up to 31% despite some degradation with repeated compressions.

Details

Motivation: To make small language models more accessible and cost-effective in low-resource settings by enabling efficient long-context processing despite limited parameters and computational resources.

Method: Empirical study using 300M-parameter LLaMA models pretrained with Infini-attention, which builds compressed memory from past segments while maintaining local attention mechanisms.

Result: Infini-attention models show training stability and outperform baselines in long-context retrieval, achieving up to 31% higher accuracy despite some performance degradation at 16,384-token contexts due to repeated memory compressions.

Conclusion: Architectural memory mechanisms like Infini-attention are beneficial for achieving robust long-context capabilities in small language models, effectively compensating for their parameter limitations.

Abstract: This study investigates small-scale pretraining for Small Language Models (SLMs) to enable efficient use of limited data and compute, improve accessibility in low-resource settings and reduce costs. To enhance long-context extrapolation in compact models, we focus on Infini-attention, which builds a compressed memory from past segments while preserving local attention. In our work, we conduct an empirical study using 300M-parameter LLaMA models pretrained with Infini-attention. The model demonstrates training stability and outperforms the baseline in long-context retrieval. We identify the balance factor as a key part of the model performance, and we found that retrieval accuracy drops with repeated memory compressions over long sequences. Even so, Infini-attention still effectively compensates for the SLM’s limited parameters. Particularly, despite performance degradation at a 16,384-token context, the Infini-attention model achieves up to 31% higher accuracy than the baseline. Our findings suggest that achieving robust long-context capability in SLMs benefits from architectural memory like Infini-attention.

[384] Max-Entropy Reinforcement Learning with Flow Matching and A Case Study on LQR

Yuyang Zhang, Yang Hu, Bo Dai, Na Li

Main category: cs.LG

TL;DR: A variant of SAC algorithm using flow-based models for policy parameterization with an online flow matching method (ISFM) that enables policy updates using samples from a user-specified distribution rather than the unknown target distribution.

Details

Motivation: Standard SAC implementations often use simple policy classes for efficiency, sacrificing expressiveness and robustness. The authors aim to leverage the rich expressiveness of flow-based models while maintaining practical efficiency.

Method: Proposes a SAC variant with flow-based policy parameterization, using instantaneous change-of-variable technique for policy evaluation and an online flow matching method called Importance Sampling Flow Matching (ISFM) that enables updates with samples from user-specified distributions rather than unknown target distributions.

Result: Developed theoretical analysis of ISFM showing how different sampling distributions affect learning efficiency. Demonstrated on max-entropy linear quadratic regulator problems that the algorithm learns the optimal action distribution.

Conclusion: The proposed flow-based SAC variant successfully combines the expressiveness of flow models with practical efficiency through the ISFM method, enabling effective learning of optimal action distributions in max-entropy RL problems.

Abstract: Soft actor-critic (SAC) is a popular algorithm for max-entropy reinforcement learning. In practice, the energy-based policies in SAC are often approximated using simple policy classes for efficiency, sacrificing the expressiveness and robustness. In this paper, we propose a variant of the SAC algorithm that parameterizes the policy with flow-based models, leveraging their rich expressiveness. In the algorithm, we evaluate the flow-based policy utilizing the instantaneous change-of-variable technique and update the policy with an online variant of flow matching developed in this paper. This online variant, termed importance sampling flow matching (ISFM), enables policy update with only samples from a user-specified sampling distribution rather than the unknown target distribution. We develop a theoretical analysis of ISFM, characterizing how different choices of sampling distributions affect the learning efficiency. Finally, we conduct a case study of our algorithm on the max-entropy linear quadratic regulator problems, demonstrating that the proposed algorithm learns the optimal action distribution.

[385] Efficient Deep Learning for Short-Term Solar Irradiance Time Series Forecasting: A Benchmark Study in Ho Chi Minh City

Tin Hoang

Main category: cs.LG

TL;DR: Benchmark of 10 deep learning architectures for 1-hour ahead solar irradiance forecasting shows Transformer achieves best accuracy (R²=0.9696), with Knowledge Distillation enabling 23.5% model compression while improving performance.

Details

Motivation: Reliable solar irradiance forecasting is crucial for managing solar energy variability in power grids, requiring accurate short-term predictions to ensure grid stability and efficient energy management.

Method: Comprehensive benchmark of 10 deep learning architectures (LSTM, TCN, Transformer, Informer, iTransformer, TSMixer, Mamba, etc.) using 10 years of high-resolution NSRDB satellite data for Ho Chi Minh City. Used SHAP analysis for temporal reasoning comparison and Knowledge Distillation for model compression.

Result: Transformer achieved highest predictive accuracy (R²=0.9696). SHAP analysis revealed Transformers show “recency bias” while Mamba leverages 24-hour periodic dependencies. Knowledge Distillation compressed Transformer by 23.5% while reducing error (MAE: 23.78 W/m²).

Conclusion: Transformer is optimal for solar irradiance forecasting, with Knowledge Distillation providing effective compression for edge deployment. Different architectures exhibit distinct temporal reasoning patterns, offering insights for architecture selection based on forecasting needs.

Abstract: Reliable forecasting of Global Horizontal Irradiance (GHI) is essential for mitigating the variability of solar energy in power grids. This study presents a comprehensive benchmark of ten deep learning architectures for short-term (1-hour ahead) GHI time series forecasting in Ho Chi Minh City, leveraging high-resolution NSRDB satellite data (2011-2020) to compare established baselines (e.g. LSTM, TCN) against emerging state-of-the-art architectures, including Transformer, Informer, iTransformer, TSMixer, and Mamba. Experimental results identify the Transformer as the superior architecture, achieving the highest predictive accuracy with an R^2 of 0.9696. The study further utilizes SHAP analysis to contrast the temporal reasoning of these architectures, revealing that Transformers exhibit a strong “recency bias” focused on immediate atmospheric conditions, whereas Mamba explicitly leverages 24-hour periodic dependencies to inform predictions. Furthermore, we demonstrate that Knowledge Distillation can compress the high-performance Transformer by 23.5% while surprisingly reducing error (MAE: 23.78 W/m^2), offering a proven pathway for deploying sophisticated, low-latency forecasting on resource-constrained edge devices.

[386] Rethinking Dense Linear Transformations: Stagewise Pairwise Mixing (SPM) for Near-Linear Training in Neural Networks

Peter Farag

Main category: cs.LG

TL;DR: SPM replaces dense linear layers with stagewise pairwise mixers, achieving O(nL) complexity instead of O(n²) while maintaining competitive performance.

Details

Motivation: Dense linear layers are computationally expensive (quadratic complexity) and often misaligned with the compositional structure of learned representations, creating need for more efficient structured alternatives.

Method: SPM replaces dense matrices with composition of sparse pairwise-mixing stages, implementing global linear transformation in O(nL) time/parameters. Two parameterizations: orthogonal norm-preserving rotation-based variant and fully general 2×2 mixing variant.

Result: Proof-of-concept experiments show substantial reductions in wall-clock cost and improved accuracy on structured learning problems, while maintaining competitive performance on real-world benchmarks.

Conclusion: SPM provides efficient drop-in replacement for dense linear layers with explicit compositional inductive bias that improves generalization when aligned with task structure.

Abstract: Dense linear layers are a dominant source of computational and parametric cost in modern machine learning models, despite their quadratic complexity and often being misaligned with the compositional structure of learned representations. We introduce Stagewise Pairwise Mixers (SPM), a structured linear operator that replaces dense matrices with a composition of sparse pairwise-mixing stages. An SPM layer implements a global linear transformation in $O(nL)$ time with $O(nL)$ parameters, where $L$ is typically constant or $log_2n$, and admits exact closed-form forward and backward computations. SPM is designed as a drop-in replacement for dense linear layers in feedforward networks, recurrent architectures, attention mechanisms, etc. We derive complete forward and backward expressions for two parameterizations: an orthogonal norm-preserving rotation-based variant and a fully general $2 \times 2$ mixing variant. Beyond computational savings, the stagewise structure of SPM induces an explicit compositional inductive bias that constrains model capacity and improves generalization when aligned with task structure. We present proof-of-concept experiments demonstrating substantial reductions in wall-clock cost and improved accuracy on structured learning problems, while retaining competitive performance on real-world benchmarks.

[387] Constraint Breeds Generalization: Temporal Dynamics as an Inductive Bias

Xia Chen

Main category: cs.LG

TL;DR: Physical constraints in biological systems serve as temporal inductive bias that improves generalization in neural networks through dissipative dynamics that compress phase space and abstract invariant features.

Details

Motivation: Conventional deep learning focuses on unconstrained optimization, but biological systems operate under strict metabolic constraints. The paper proposes that these physical constraints aren't limitations but actually shape dynamics to function as temporal inductive bias that promotes generalization.

Method: Phase-space analysis of signal propagation reveals that proper dissipative dynamics compress phase space, aligning with spectral bias to abstract invariant features. This can be imposed externally via input encoding or intrinsically through network temporal dynamics. The approach requires architectures capable of temporal integration and proper constraints to decode induced invariants.

Result: Comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning demonstrate that a critical “transition” regime maximizes generalization capability. The approach shows that dynamical constraints serve as effective inductive bias.

Conclusion: Dynamical constraints represent a distinct class of inductive bias. Robust AI development requires not just scaling and removing limitations, but computationally mastering temporal characteristics that naturally promote generalization, similar to biological systems.

Abstract: Conventional deep learning prioritizes unconstrained optimization, yet biological systems operate under strict metabolic constraints. We propose that these physical constraints shape dynamics to function not as limitations, but as a temporal inductive bias that breeds generalization. Through a phase-space analysis of signal propagation, we reveal a fundamental asymmetry: expansive dynamics amplify noise, whereas proper dissipative dynamics compress phase space that aligns with the network’s spectral bias, compelling the abstraction of invariant features. This condition can be imposed externally via input encoding, or intrinsically through the network’s own temporal dynamics. Both pathways require architectures capable of temporal integration and proper constraints to decode induced invariants, whereas static architectures fail to capitalize on temporal structure. Through comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning, we demonstrate that a critical “transition” regime maximizes generalization capability. These findings establish dynamical constraints as a distinct class of inductive bias, suggesting that robust AI development requires not only scaling and removing limitations, but computationally mastering the temporal characteristics that naturally promote generalization.

[388] Interactive Machine Learning: From Theory to Scale

Yinglun Zhu

Main category: cs.LG

TL;DR: This dissertation advances interactive machine learning theory with efficient algorithms for active learning, contextual bandits, and model selection, achieving statistical optimality and computational efficiency.

Details

Motivation: Traditional ML methods require large labeled datasets or extensive trial-and-error, which is expensive, time-consuming, or risky in large-scale/high-stakes settings. Interactive learning can reduce these costs by actively guiding information collection.

Method: Develops new algorithmic principles for interactive learning across three dimensions: 1) active learning with noisy data and rich model classes, 2) sequential decision making with large action spaces (contextual bandits), and 3) model selection under partial feedback.

Result: First computationally efficient active learning algorithms achieving exponential label savings without low-noise assumptions; first efficient general-purpose contextual bandit algorithms with guarantees independent of action space size; first tight characterizations of fundamental cost of model selection in sequential decision making.

Conclusion: Advances theoretical foundations of interactive learning with statistically optimal and computationally efficient algorithms, providing principled guidance for deploying interactive learning in large-scale real-world settings.

Abstract: Machine learning has achieved remarkable success across a wide range of applications, yet many of its most effective methods rely on access to large amounts of labeled data or extensive online interaction. In practice, acquiring high-quality labels and making decisions through trial-and-error can be expensive, time-consuming, or risky, particularly in large-scale or high-stakes settings. This dissertation studies interactive machine learning, in which the learner actively influences how information is collected or which actions are taken, using past observations to guide future interactions. We develop new algorithmic principles and establish fundamental limits for interactive learning along three dimensions: active learning with noisy data and rich model classes, sequential decision making with large action spaces, and model selection under partial feedback. Our results include the first computationally efficient active learning algorithms achieving exponential label savings without low-noise assumptions; the first efficient, general-purpose contextual bandit algorithms whose guarantees are independent of the size of the action space; and the first tight characterizations of the fundamental cost of model selection in sequential decision making. Overall, this dissertation advances the theoretical foundations of interactive learning by developing algorithms that are statistically optimal and computationally efficient, while also providing principled guidance for deploying interactive learning methods in large-scale, real-world settings.

[389] Improved Balanced Classification with Theoretically Grounded Loss Functions

Corinna Cortes, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: This paper introduces two advanced surrogate loss families (GLA and GCA) for imbalanced classification, showing GCA has stronger theoretical guarantees with better scaling (1/√p_min vs 1/p_min) and both outperform existing methods empirically.

Details

Motivation: Balanced loss is important for fairness in imbalanced classification but is intractable to minimize directly, creating a need for effective surrogate losses with strong theoretical guarantees.

Method: Introduces two loss families: Generalized Logit-Adjusted (GLA) losses that extend Logit-Adjusted methods to general cross-entropy, and Generalized Class-Aware weighted (GCA) losses that extend class-weighted losses with class-dependent confidence margins.

Result: Theoretical analysis shows GLA losses are Bayes-consistent but only H-consistent for unbounded hypothesis sets with bounds scaling as 1/p_min. GCA losses are H-consistent for any hypothesis set with better scaling as 1/√p_min. Empirically, both outperform existing methods, with GLA slightly better in common benchmarks and GCA better in highly imbalanced settings.

Conclusion: GCA losses offer stronger theoretical guarantees for imbalanced classification with more favorable scaling properties, while both GLA and GCA provide practical improvements over existing surrogate loss methods, making them valuable tools for fairness in class-imbalanced scenarios.

Abstract: The balanced loss is a widely adopted objective for multi-class classification under class imbalance. By assigning equal importance to all classes, regardless of their frequency, it promotes fairness and ensures that minority classes are not overlooked. However, directly minimizing the balanced classification loss is typically intractable, which makes the design of effective surrogate losses a central question. This paper introduces and studies two advanced surrogate loss families: Generalized Logit-Adjusted (GLA) loss functions and Generalized Class-Aware weighted (GCA) losses. GLA losses generalize Logit-Adjusted losses, which shift logits based on class priors, to the broader general cross-entropy loss family. GCA loss functions extend the standard class-weighted losses, which scale losses inversely by class frequency, by incorporating class-dependent confidence margins and extending them to the general cross-entropy family. We present a comprehensive theoretical analysis of consistency for both loss families. We show that GLA losses are Bayes-consistent, but only $H$-consistent for complete (i.e., unbounded) hypothesis sets. Moreover, their $H$-consistency bounds depend inversely on the minimum class probability, scaling at least as $1/\mathsf p_{\min}$. In contrast, GCA losses are $H$-consistent for any hypothesis set that is bounded or complete, with $H$-consistency bounds that scale more favorably as $1/\sqrt{\mathsf p_{\min}}$, offering significantly stronger theoretical guarantees in imbalanced settings. We report the results of experiments demonstrating that, empirically, both the GCA losses with calibrated class-dependent confidence margins and GLA losses can greatly outperform straightforward class-weighted losses as well as the LA losses. GLA generally performs slightly better in common benchmarks, whereas GCA exhibits a slight edge in highly imbalanced settings.

[390] DivQAT: Enhancing Robustness of Quantized Convolutional Neural Networks against Model Extraction Attacks

Kacem Khaled, Felipe Gohring de Magalhães, Gabriela Nicolescu

Main category: cs.LG

TL;DR: DivQAT is a novel Quantization Aware Training algorithm that enhances robustness of quantized CNNs against model extraction attacks by modifying the quantization process to integrate defense mechanisms during training.

Details

Motivation: Quantized CNNs are vulnerable to extraction attacks (IP theft), but their robustness is understudied compared to large models. Existing defenses are limited as they are added post-training, computationally expensive, have unrealistic assumptions, and don't work well for quantized models on edge devices.

Method: DivQAT modifies the quantization process in Quantization Aware Training to integrate extraction defense mechanisms directly into the training process, making it the first technique to incorporate defense during quantization-aware training.

Result: Empirical validation on benchmark vision datasets shows DivQAT effectively defends against model extraction attacks without compromising model accuracy. When combined with other defense mechanisms, it improves their effectiveness compared to traditional QAT.

Conclusion: DivQAT provides an effective approach to enhance robustness of quantized CNNs against extraction attacks by integrating defense mechanisms directly into the quantization-aware training process, addressing limitations of existing post-training defenses.

Abstract: Convolutional Neural Networks (CNNs) and their quantized counterparts are vulnerable to extraction attacks, posing a significant threat of IP theft. Yet, the robustness of quantized models against these attacks is little studied compared to large models. Previous defenses propose to inject calculated noise into the prediction probabilities. However, these defenses are limited since they are not incorporated during the model design and are only added as an afterthought after training. Additionally, most defense techniques are computationally expensive and often have unrealistic assumptions about the victim model that are not feasible in edge device implementations and do not apply to quantized models. In this paper, we propose DivQAT, a novel algorithm to train quantized CNNs based on Quantization Aware Training (QAT) aiming to enhance their robustness against extraction attacks. To the best of our knowledge, our technique is the first to modify the quantization process to integrate a model extraction defense into the training process. Through empirical validation on benchmark vision datasets, we demonstrate the efficacy of our technique in defending against model extraction attacks without compromising model accuracy. Furthermore, combining our quantization technique with other defense mechanisms improves their effectiveness compared to traditional QAT.

[391] Physics-informed Graph Neural Networks for Operational Flood Modeling

Carlo Malapad Acosta, Herath Mudiyanselage Viraj Vidura Herath, Jia Yu Lim, Abhishek Saha, Sanka Rasnayaka, Lucy Marshall

Main category: cs.LG

TL;DR: DUALFloodGNN is a novel graph neural network architecture for flood modeling that embeds physical constraints at global and local scales, jointly predicts water volume and flow, and uses multi-step training with curriculum learning for improved autoregressive inference.

Details

Motivation: Physics-based numerical flood models are accurate but computationally expensive, limiting their use in operational settings where rapid predictions are essential. GNNs offer speed and accuracy while handling unstructured spatial domains, and can be enhanced with physics-informed techniques for better interpretability.

Method: DUALFloodGNN embeds physical constraints through explicit loss terms at both global and local scales. It uses a shared message-passing framework to jointly predict water volume at nodes and flow along edges. Training employs multi-step loss enhanced with dynamic curriculum learning to improve autoregressive inference performance.

Result: Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables while maintaining high computational efficiency.

Conclusion: DUALFloodGNN provides an effective solution for rapid flood prediction by combining physics-informed constraints with GNN efficiency, making it suitable for operational disaster management. The model is open-sourced for community use.

Abstract: Flood models inform strategic disaster management by simulating the spatiotemporal hydrodynamics of flooding. While physics-based numerical flood models are accurate, their substantial computational cost limits their use in operational settings where rapid predictions are essential. Models designed with graph neural networks (GNNs) provide both speed and accuracy while having the ability to process unstructured spatial domains. Given its flexible input and architecture, GNNs can be leveraged alongside physics-informed techniques with ease, significantly improving interpretability. This study introduces a novel flood GNN architecture, DUALFloodGNN, which embeds physical constraints at both global and local scales through explicit loss terms. The model jointly predicts water volume at nodes and flow along edges through a shared message-passing framework. To improve performance for autoregressive inference, model training is conducted with a multi-step loss enhanced with dynamic curriculum learning. Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables while maintaining high computational efficiency. The model is open-sourced at https://github.com/acostacos/dual_flood_gnn.

[392] Causify DataFlow: A Framework For High-performance Machine Learning Stream Computing

Giacinto Paolo Saggese, Paul Smith

Main category: cs.LG

TL;DR: DataFlow is a unified framework for building ML systems on unbounded time-series data that ensures identical execution in batch and streaming modes without code changes, solving causality and reproducibility issues.

Details

Motivation: Traditional ML workflows assume finite datasets and require substantial reimplementation when moving from batch prototypes to streaming production, leading to causality violations, batch boundary artifacts, and poor reproducibility of real-time failures.

Method: Uses a unified execution model based on DAGs with point-in-time idempotency: outputs at any time t depend only on a fixed-length context window preceding t. Enforces strict causality by automatically tracking knowledge time across transformations, supports flexible tiling across temporal and feature dimensions, and provides fit/predict semantics for online learning with caching, incremental computation, and automatic parallelization.

Result: Demonstrated effectiveness across domains including financial trading, IoT, fraud detection, and real-time analytics. Models developed in batch mode execute identically in streaming production without code changes.

Conclusion: DataFlow resolves the gap between batch ML development and streaming production by providing a unified framework that ensures identical execution in both modes, eliminating causality violations and improving reproducibility while maintaining flexibility across different application domains.

Abstract: We present DataFlow, a computational framework for building, testing, and deploying high-performance machine learning systems on unbounded time-series data. Traditional data science workflows assume finite datasets and require substantial reimplementation when moving from batch prototypes to streaming production systems. This gap introduces causality violations, batch boundary artifacts, and poor reproducibility of real-time failures. DataFlow resolves these issues through a unified execution model based on directed acyclic graphs (DAGs) with point-in-time idempotency: outputs at any time t depend only on a fixed-length context window preceding t. This guarantee ensures that models developed in batch mode execute identically in streaming production without code changes. The framework enforces strict causality by automatically tracking knowledge time across all transformations, eliminating future-peeking bugs. DataFlow supports flexible tiling across temporal and feature dimensions, allowing the same model to operate at different frequencies and memory profiles via configuration alone. It integrates natively with the Python data science stack and provides fit/predict semantics for online learning, caching and incremental computation, and automatic parallelization through DAG-based scheduling. We demonstrate its effectiveness across domains including financial trading, IoT, fraud detection, and real-time analytics.

[393] Assured Autonomy: How Operations Research Powers and Orchestrates Generative AI Systems

Tinglong Dai, David Simchi-Levi, Michelle Xiao Wu, Yao Xie

Main category: cs.LG

TL;DR: GenAI is moving from conversational assistants to autonomous agentic systems, creating an autonomy paradox where greater autonomy requires more formal structure and constraints. The paper proposes an operations research framework using flow-based generative models and adversarial robustness for assured autonomy in safety-critical domains.

Details

Motivation: The shift from conversational GenAI to autonomous agentic systems creates an "autonomy paradox" - as systems gain more operational autonomy, they paradoxically need more formal structure, explicit constraints, and stronger risk discipline. Stochastic generative models can be fragile in operational domains without mechanisms for verifiable feasibility, robustness to distribution shift, and stress testing.

Method: Develops a conceptual framework for assured autonomy grounded in operations research with two complementary approaches: 1) Flow-based generative models that frame generation as deterministic transport via ordinary differential equations, enabling auditability and constraint-aware generation; 2) Operational safety through adversarial robustness lens where decision rules are evaluated against worst-case perturbations within uncertainty sets.

Result: The framework clarifies how increasing autonomy shifts operations research’s role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. This defines a research agenda for assured autonomy in safety-critical operational domains.

Conclusion: A structured operations research approach combining flow-based generative models with adversarial robustness principles is essential for developing assured autonomous GenAI systems that can operate safely and reliably in high-stakes operational domains, addressing the fundamental autonomy paradox.

Abstract: Generative artificial intelligence (GenAI) is shifting from conversational assistants toward agentic systems – autonomous decision-making systems that sense, decide, and act within operational workflows. This shift creates an autonomy paradox: as GenAI systems are granted greater operational autonomy, they should, by design, embody more formal structure, more explicit constraints, and stronger tail-risk discipline. We argue stochastic generative models can be fragile in operational domains unless paired with mechanisms that provide verifiable feasibility, robustness to distribution shift, and stress testing under high-consequence scenarios. To address this challenge, we develop a conceptual framework for assured autonomy grounded in operations research (OR), built on two complementary approaches. First, flow-based generative models frame generation as deterministic transport characterized by an ordinary differential equation, enabling auditability, constraint-aware generation, and connections to optimal transport, robust optimization, and sequential decision control. Second, operational safety is formulated through an adversarial robustness lens: decision rules are evaluated against worst-case perturbations within uncertainty or ambiguity sets, making unmodeled risks part of the design. This framework clarifies how increasing autonomy shifts OR’s role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. These elements define a research agenda for assured autonomy in safety-critical, reliability-sensitive operational domains.

[394] Information-Theoretic Quality Metric of Low-Dimensional Embeddings

Sebastián Gutiérrez-Bernal, Hector Medel Cobaxin, Abiel Galindo González

Main category: cs.LG

TL;DR: The paper introduces ERPM, an information-theoretic metric for evaluating low-dimensional embeddings that measures information preservation through entropy and stable rank analysis of neighborhood matrices, complementing existing distance-based and geometric metrics.

Details

Motivation: Classical embedding evaluation metrics (stress, rank-based neighborhood criteria, Local Procrustes) only quantify distortions in distances or local geometries, but fail to directly assess how much information is preserved when projecting high-dimensional data to lower dimensions. There's a need for metrics that explicitly measure information preservation from an information-theoretic perspective.

Method: Introduces the Entropy Rank Preservation Measure (ERPM), a local metric based on Shannon entropy of the singular-value spectrum of neighborhood matrices and stable rank. ERPM quantifies changes in uncertainty between original high-dimensional data and its reduced projection, providing both neighborhood-level indicators and a global summary statistic.

Result: Distance-based criteria (like MRRE) show very low correlation with geometric and spectral measures. ERPM and Local Procrustes show strong average correlation but display significant discrepancies in local regimes. ERPM identifies neighborhoods with severe information loss that other metrics miss.

Conclusion: ERPM complements existing embedding evaluation metrics by providing an information-theoretic perspective that identifies neighborhoods with severe information loss. This enables more comprehensive assessment of embeddings, particularly valuable for information-sensitive applications like early-warning indicator construction.

Abstract: In this work we study the quality of low-dimensional embeddings from an explicitly information-theoretic perspective. We begin by noting that classical evaluation metrics such as stress, rank-based neighborhood criteria, or Local Procrustes quantify distortions in distances or in local geometries, but do not directly assess how much information is preserved when projecting high-dimensional data onto a lower-dimensional space. To address this limitation, we introduce the Entropy Rank Preservation Measure (ERPM), a local metric based on the Shannon entropy of the singular-value spectrum of neighborhood matrices and on the stable rank, which quantifies changes in uncertainty between the original representation and its reduced projection, providing neighborhood-level indicators and a global summary statistic. To validate the results of the metric, we compare its outcomes with the Mean Relative Rank Error (MRRE), which is distance-based, and with Local Procrustes, which is based on geometric properties, using a financial time series and a manifold commonly studied in the literature. We observe that distance-based criteria exhibit very low correlation with geometric and spectral measures, while ERPM and Local Procrustes show strong average correlation but display significant discrepancies in local regimes, leading to the conclusion that ERPM complements existing metrics by identifying neighborhoods with severe information loss, thereby enabling a more comprehensive assessment of embeddings, particularly in information-sensitive applications such as the construction of early-warning indicators.

[395] Tracing the Heart’s Pathways: ECG Representation Learning from a Cardiac Conduction Perspective

Tan Pan, Yixuan Sun, Chen Jiang, Qiong Gao, Rui Sun, Xingmeng Zhang, Zhenqi Yang, Limei Han, Yixiu Liang, Yuan Cheng, Kaiyu Guo

Main category: cs.LG

TL;DR: CLEAR-HUG is a two-stage ECG self-supervised learning framework that captures subtle cardiac conduction variations across leads while following clinical diagnostic workflows, achieving 6.84% performance improvement across six tasks.

Details

Motivation: Existing ECG self-supervised learning methods focus on consistent patterns across leads and beats but overlook inherent heartbeat differences rooted in cardiac conduction processes. They also fail to align with clinical diagnostic guidelines that progress from individual heartbeats to single leads to lead combinations.

Method: Two-stage framework: 1) CLEAR (Conduction-LEAd Reconstructor) - an eSSL model using sparse attention to reconstruct signals while treating each heartbeat as distinct entity, capturing both specific variations and general commonalities. 2) HUG (Hierarchical lead-Unified Group head) - a diagnostic module mirroring clinical workflow for disease diagnosis.

Result: Experimental results across six tasks show a 6.84% improvement, validating the effectiveness of CLEAR-HUG in enhancing representations of cardiac conduction and aligning patterns with expert diagnostic guidelines.

Conclusion: CLEAR-HUG successfully addresses limitations of previous eSSL methods by capturing subtle cardiac conduction variations and aligning with clinical diagnostic workflows, demonstrating improved performance for ECG analysis tasks.

Abstract: The multi-lead electrocardiogram (ECG) stands as a cornerstone of cardiac diagnosis. Recent strides in electrocardiogram self-supervised learning (eSSL) have brightened prospects for enhancing representation learning without relying on high-quality annotations. Yet earlier eSSL methods suffer a key limitation: they focus on consistent patterns across leads and beats, overlooking the inherent differences in heartbeats rooted in cardiac conduction processes, while subtle but significant variations carry unique physiological signatures. Moreover, representation learning for ECG analysis should align with ECG diagnostic guidelines, which progress from individual heartbeats to single leads and ultimately to lead combinations. This sequential logic, however, is often neglected when applying pre-trained models to downstream tasks. To address these gaps, we propose CLEAR-HUG, a two-stage framework designed to capture subtle variations in cardiac conduction across leads while adhering to ECG diagnostic guidelines. In the first stage, we introduce an eSSL model termed Conduction-LEAd Reconstructor (CLEAR), which captures both specific variations and general commonalities across heartbeats. Treating each heartbeat as a distinct entity, CLEAR employs a simple yet effective sparse attention mechanism to reconstruct signals without interference from other heartbeats. In the second stage, we implement a Hierarchical lead-Unified Group head (HUG) for disease diagnosis, mirroring clinical workflow. Experimental results across six tasks show a 6.84% improvement, validating the effectiveness of CLEAR-HUG. This highlights its ability to enhance representations of cardiac conduction and align patterns with expert diagnostic guidelines.

[396] Hyperspherical Graph Representation Learning via Adaptive Neighbor-Mean Alignment and Uniformity

Rui Chen, Junjun Guo, Hongbin Wang, Yan Xiang, Yantuan Xian, Zhengtao Yu

Main category: cs.LG

TL;DR: HyperGRL is a hyperspherical graph representation learning framework using adaptive neighbor-mean alignment and sampling-free uniformity for stable, high-quality embeddings without complex negative sampling.

Details

Motivation: Existing graph representation learning methods rely on complex contrastive objectives with negative sampling, leading to issues like over-smoothing, over-squashing, training instability, and sensitivity to hyperparameters.

Method: Embeds nodes on a unit hypersphere using two adversarially coupled objectives: neighbor-mean alignment (using local neighborhood means as stable targets) and sampling-free uniformity (L2-based hyperspherical regularization). Includes entropy-guided adaptive balancing for dynamic regulation.

Result: Achieves superior representation quality and generalization across node classification (1.49% improvement), node clustering (0.86% improvement), and link prediction (0.74% improvement) over strongest existing methods.

Conclusion: Geometrically grounded, sampling-free contrastive objectives are effective for graph representation learning, providing stable training and high-quality embeddings without complex negative sampling strategies.

Abstract: Graph representation learning (GRL) aims to encode structural and semantic dependencies of graph-structured data into low-dimensional embeddings. However, existing GRL methods often rely on surrogate contrastive objectives or mutual information maximization, which typically demand complex architectures, negative sampling strategies, and sensitive hyperparameter tuning. These design choices may induce over-smoothing, over-squashing, and training instability. In this work, we propose HyperGRL, a unified framework for hyperspherical graph representation learning via adaptive neighbor-mean alignment and sampling-free uniformity. HyperGRL embeds nodes on a unit hypersphere through two adversarially coupled objectives: neighbor-mean alignment and sampling-free uniformity. The alignment objective uses the mean representation of each node’s local neighborhood to construct semantically grounded, stable targets that capture shared structural and feature patterns. The uniformity objective formulates dispersion via an L2-based hyperspherical regularization, encouraging globally uniform embedding distributions while preserving discriminative information. To further stabilize training, we introduce an entropy-guided adaptive balancing mechanism that dynamically regulates the interplay between alignment and uniformity without requiring manual tuning. Extensive experiments on node classification, node clustering, and link prediction demonstrate that HyperGRL delivers superior representation quality and generalization across diverse graph structures, achieving average improvements of 1.49%, 0.86%, and 0.74% over the strongest existing methods, respectively. These findings highlight the effectiveness of geometrically grounded, sampling-free contrastive objectives for graph representation learning.

[397] How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns

Haoyue Bai, Yiyou Sun, Wenjie Hu, Shi Qiu, Maggie Ziyu Huan, Peiyang Song, Robert Nowak, Dawn Song

Main category: cs.LG

TL;DR: This paper introduces a novel benchmark to analyze why RL tuning preserves LLM capabilities while SFT narrows them, by decomposing reasoning into atomic core skills and tracking behavioral changes during training.

Details

Motivation: The motivation is to understand why supervised fine-tuning (SFT) often narrows LLM capabilities while reinforcement-learning (RL) tuning tends to preserve them, moving beyond coarse accuracy metrics to examine the fundamental nature of reasoning in LLMs.

Method: The authors introduce a novel benchmark that decomposes reasoning into atomic core skills (calculation, fact retrieval, simulation, enumeration, diagnostic) and use a meta-probing framework to track model behavior at different training stages, combined with analyses of low-level statistical patterns like distributional divergence and parameter statistics.

Result: The benchmark reveals that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, while SFT models exhibit sharper drift and overfit to surface patterns, providing granular insights into how specific cognitive abilities emerge, transfer, and sometimes collapse during post-training.

Conclusion: This work provides new insights into the nature of reasoning in LLMs and points toward principles for designing training strategies that foster broad, robust generalization, showing that RL tuning better preserves reasoning capabilities compared to SFT.

Abstract: Large Language Models (LLMs) display strikingly different generalization behaviors: supervised fine-tuning (SFT) often narrows capability, whereas reinforcement-learning (RL) tuning tends to preserve it. The reasons behind this divergence remain unclear, as prior studies have largely relied on coarse accuracy metrics. We address this gap by introducing a novel benchmark that decomposes reasoning into atomic core skills such as calculation, fact retrieval, simulation, enumeration, and diagnostic, providing a concrete framework for addressing the fundamental question of what constitutes reasoning in LLMs. By isolating and measuring these core skills, the benchmark offers a more granular view of how specific cognitive abilities emerge, transfer, and sometimes collapse during post-training. Combined with analyses of low-level statistical patterns such as distributional divergence and parameter statistics, it enables a fine-grained study of how generalization evolves under SFT and RL across mathematical, scientific reasoning, and non-reasoning tasks. Our meta-probing framework tracks model behavior at different training stages and reveals that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns. This work provides new insights into the nature of reasoning in LLMs and points toward principles for designing training strategies that foster broad, robust generalization.

[398] Time-varying Mixing Matrix Design for Energy-efficient Decentralized Federated Learning

Xusheng Zhang, Tuan Nguyen, Ting He

Main category: cs.LG

TL;DR: Proposes energy-efficient mixing matrix design for decentralized federated learning in wireless networks to minimize maximum per-node energy consumption while leveraging broadcast communications.

Details

Motivation: Existing mixing matrix designs for DFL focus on minimizing communication time but neglect per-node energy consumption, which is critical for energy-constrained wireless devices. There's a gap in optimizing energy efficiency while considering wireless broadcast nature.

Method: Develops a novel convergence theorem allowing arbitrarily time-varying mixing matrices, then proposes a multi-phase design framework that activates time-varying communication topologies under optimized budgets to trade off per-iteration energy consumption and convergence rate while balancing energy across nodes.

Result: The proposed solution effectively combines low energy consumption of sparse mixing matrices with fast convergence of dense mixing matrices, validated through evaluations based on real data.

Conclusion: The work addresses the energy efficiency gap in DFL mixing matrix design by providing a theoretically-justified solution that minimizes maximum per-node energy consumption while leveraging wireless broadcast advantages and time-varying topologies.

Abstract: We consider the design of mixing matrices to minimize the operation cost for decentralized federated learning (DFL) in wireless networks, with focus on minimizing the maximum per-node energy consumption. As a critical hyperparameter for DFL, the mixing matrix controls both the convergence rate and the needs of agent-to-agent communications, and has thus been studied extensively. However, existing designs mostly focused on minimizing the communication time, leaving open the minimization of per-node energy consumption that is critical for energy-constrained devices. This work addresses this gap through a theoretically-justified solution for mixing matrix design that aims at minimizing the maximum per-node energy consumption until convergence, while taking into account the broadcast nature of wireless communications. Based on a novel convergence theorem that allows arbitrarily time-varying mixing matrices, we propose a multi-phase design framework that activates time-varying communication topologies under optimized budgets to trade off the per-iteration energy consumption and the convergence rate while balancing the energy consumption across nodes. Our evaluations based on real data have validated the efficacy of the proposed solution in combining the low energy consumption of sparse mixing matrices and the fast convergence of dense mixing matrices.

Jiazhao Shi, Ziyu Wang, Yichen Lin, Shoufeng Lu

Main category: cs.LG

TL;DR: TPI-AI: A hybrid framework combining deep temporal representations with physics-inspired interaction features for robust lane-change intention prediction across diverse highway scenarios.

Details

Motivation: Lane-change intention prediction is crucial for autonomous driving safety but faces challenges including noisy kinematics, severe class imbalance, and limited generalization across heterogeneous highway scenarios.

Method: Two-layer bidirectional LSTM encoder learns compact embeddings from trajectory histories, concatenated with physics-inspired features (headway, TTC, safe-gap indicators). LightGBM classifier trained with imbalance-aware optimization (resampling/weighting and threshold calibration) for three-class intention recognition.

Result: Outperforms standalone LightGBM and Bi-LSTM baselines on highD and exiD datasets. Achieves macro-F1 scores: highD (0.9562, 0.9124, 0.8345) and exiD (0.9247, 0.8197, 0.7605) at T = 1, 2, 3 seconds respectively.

Conclusion: Combining physics-informed interaction features with learned temporal embeddings yields robust multi-scenario lane-change intention prediction, addressing challenges of noisy data, class imbalance, and generalization across highway environments.

Abstract: Lane-change intention prediction is safety-critical for autonomous driving and ADAS, but remains difficult in naturalistic traffic due to noisy kinematics, severe class imbalance, and limited generalization across heterogeneous highway scenarios. We propose Temporal Physics-Informed AI (TPI-AI), a hybrid framework that fuses deep temporal representations with physics-inspired interaction cues. A two-layer bidirectional LSTM (Bi-LSTM) encoder learns compact embeddings from multi-step trajectory histories; we concatenate these embeddings with kinematics-, safety-, and interaction-aware features (e.g., headway, TTC, and safe-gap indicators) and train a LightGBM classifier for three-class intention recognition (No-LC, Left-LC, Right-LC). To improve minority-class reliability, we apply imbalance-aware optimization including resampling/weighting and fold-wise threshold calibration. Experiments on two large-scale drone-based datasets, highD (straight highways) and exiD (ramp-rich environments), use location-based splits and evaluate prediction horizons T = 1, 2, 3 s. TPI-AI outperforms standalone LightGBM and Bi-LSTM baselines, achieving macro-F1 of 0.9562, 0.9124, 0.8345 on highD and 0.9247, 0.8197, 0.7605 on exiD at T = 1, 2, 3 s, respectively. These results show that combining physics-informed interaction features with learned temporal embeddings yields robust multi-scenario lane-change intention prediction.

[400] Autoregressivity in the Latent Space of a GP-VAE Language Model: An Empirical Ablation Study

Yves Ruffenach

Main category: cs.LG

TL;DR: Ablation study shows latent autoregression in GP-VAE improves latent structure compatibility with Gaussian process prior and long-horizon stability compared to non-autoregressive variants.

Details

Motivation: To systematically analyze the role of latent autoregression in GP-VAE models, comparing it against non-autoregressive latent variables and standard token-level autoregressive Transformers.

Method: Conducted ablation study comparing three models: (1) full GP-VAE with autoregressive latent dynamics, (2) non-autoregressive ablation with independent latent variables, and (3) standard token-level autoregressive Transformer.

Result: Latent autoregression induces latent trajectories more compatible with Gaussian-process prior and exhibits greater long-horizon stability. Removing autoregression leads to degraded latent structure and unstable long-range behavior.

Conclusion: Latent autoregression is an effective mechanism for organizing long-range structure, complementary to token-level autoregressive modeling. This is an empirical analysis of representational structure rather than a new architecture proposal.

Abstract: This paper provides an ablation-based analysis of latent autoregression in GP-VAE models, building upon our previous work introducing the architecture. Language models typically rely on an autoregressive factorization over tokens. In contrast, our prior work proposed shifting sequential structure to the latent space through a causal Gaussian process, while using a non-autoregressive decoder. Here, we conduct a systematic ablation study of the role played by latent autoregression. We compare (i) a full GP-VAE model with autoregressive latent dynamics, (ii) a non-autoregressive ablation in which latent variables are independent, and (iii) a standard token-level autoregressive Transformer. Our results show that, within the considered regime (medium-scale corpora and short training contexts), latent autoregression induces latent trajectories that are significantly more compatible with the Gaussian-process prior and exhibit greater long-horizon stability. In contrast, removing autoregression leads to degraded latent structure and unstable long-range behavior. These findings highlight the role of latent autoregression as an effective mechanism for organizing long-range structure, while remaining complementary to token-level autoregressive modeling. They should be interpreted as an empirical analysis of representational structure rather than as a proposal for a new architecture.

[401] Enhancing LLM Planning Capabilities through Intrinsic Self-Critique

Bernd Bohnet, Pierre-Alexandre Kamienny, Hanie Sedghi, Dilan Gorur, Pranjal Awasthi, Aaron Parisi, Kevin Swersky, Rosanne Liu, Azade Nova, Noah Fiedel

Main category: cs.LG

TL;DR: LLMs can significantly improve planning performance through intrinsic self-critique without external verifiers, achieving state-of-the-art results on planning benchmarks.

Details

Motivation: Despite previous research questioning the effectiveness of LLM self-critique methods, the authors aim to demonstrate that LLMs can indeed improve their own performance through intrinsic self-critique for planning tasks.

Method: Uses few-shot learning extended to many-shot approach as base method, then employs iterative correction and refinement through self-critique without external verifiers. Applied to Blocksworld, Logistics, and Mini-grid planning domains.

Result: Achieves significant performance gains over established planning benchmarks, exceeding strong baseline accuracies and setting new state-of-the-art for October 2024 LLM checkpoints.

Conclusion: Self-critique can significantly boost planning performance, demonstrating intrinsic self-improvement capabilities applicable regardless of specific model version, with potential for even better performance when applied to more complex search techniques and more capable models.

Abstract: We demonstrate an approach for LLMs to critique their \emph{own} answers with the goal of enhancing their performance that leads to significant improvements over established planning benchmarks. Despite the findings of earlier research that has cast doubt on the effectiveness of LLMs leveraging self critique methods, we show significant performance gains on planning datasets in the Blocksworld domain through intrinsic self-critique, without external source such as a verifier. We also demonstrate similar improvements on Logistics and Mini-grid datasets, exceeding strong baseline accuracies. We employ a few-shot learning technique and progressively extend it to a many-shot approach as our base method and demonstrate that it is possible to gain substantial improvement on top of this already competitive approach by employing an iterative process for correction and refinement. We illustrate how self-critique can significantly boost planning performance. Our empirical results present new state-of-the-art on the class of models considered, namely LLM model checkpoints from October 2024. Our primary focus lies on the method itself, demonstrating intrinsic self-improvement capabilities that are applicable regardless of the specific model version, and we believe that applying our method to more complex search techniques and more capable models will lead to even better performance.

[402] OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization

Advait Gadhikar, Riccardo Grazzi, James Hensman

Main category: cs.LG

TL;DR: OptRot and OptRot+ are rotation-based methods that reduce weight outliers in LLMs to improve quantization performance, with OptRot+ incorporating activation covariance for better results.

Details

Motivation: Outliers in LLM weights and activations make quantization difficult, and existing rotation methods need improvement for better weight quantization performance.

Method: OptRot minimizes element-wise fourth power of rotated weights to reduce outliers; OptRot+ adds activation covariance information. Both use GPTQ as quantization method.

Result: OptRot outperforms Hadamard rotations, SpinQuant, and OSTQuant for weight quantization and improves W4A8 activation quantization. OptRot+ further improves performance but both degrade in W4A4 setting.

Conclusion: Learning fusible rotations with principled proxy objectives effectively reduces quantization errors, but there’s a trade-off between weight and activation quantization performance.

Abstract: The presence of outliers in Large Language Models (LLMs) weights and activations makes them difficult to quantize. Recent work has leveraged rotations to mitigate these outliers. In this work, we propose methods that learn fusible rotations by minimizing principled and cheap proxy objectives to the weight quantization error. We primarily focus on GPTQ as the quantization method. Our main method is OptRot, which reduces weight outliers simply by minimizing the element-wise fourth power of the rotated weights. We show that OptRot outperforms both Hadamard rotations and more expensive, data-dependent methods like SpinQuant and OSTQuant for weight quantization. It also improves activation quantization in the W4A8 setting. We also propose a data-dependent method, OptRot$^{+}$, that further improves performance by incorporating information on the activation covariance. In the W4A4 setting, we see that both OptRot and OptRot$^{+}$ perform worse, highlighting a trade-off between weight and activation quantization.

[403] GARDO: Reinforcing Diffusion Models without Reward Hacking

Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan

Main category: cs.LG

TL;DR: GARDO is a reinforcement learning framework for fine-tuning diffusion models that addresses reward hacking, exploration limitations, and mode collapse through selective regularization, adaptive reference updates, and diversity-aware optimization.

Details

Motivation: Fine-tuning diffusion models with RL often uses proxy rewards that don't fully capture true visual quality goals, leading to reward hacking (proxy scores increase while real quality deteriorates) and diversity collapse. Existing regularization methods against reference policies compromise sample efficiency and exploration since reference policies are usually sub-optimal.

Method: GARDO (Gated and Adaptive Regularization with Diversity-aware Optimization) has three key components: 1) Selective regularization that penalizes only high-uncertainty samples instead of universal regularization, 2) Adaptive regularization where the reference model is periodically updated to match the online policy’s capabilities, and 3) Diversity-aware optimization that amplifies rewards for high-quality, diverse samples to encourage mode coverage without destabilizing training.

Result: Extensive experiments across diverse proxy rewards and hold-out unseen metrics show GARDO effectively mitigates reward hacking, enhances generation diversity, and maintains sample efficiency and exploration capabilities. The framework demonstrates robustness across different RL algorithms and reward scenarios.

Conclusion: GARDO provides a versatile solution to the competing demands in RL fine-tuning of diffusion models, addressing reward hacking, exploration limitations, and mode collapse through its selective, adaptive, and diversity-aware approach, making it an effective framework for enhancing text-to-image alignment without compromising other important training objectives.

Abstract: Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

[404] Colorful Pinball: Density-Weighted Quantile Regression for Conditional Guarantee of Conformal Prediction

Qianyi Chen, Bo Li

Main category: cs.LG

TL;DR: The paper proposes a method to improve conditional coverage in conformal prediction by minimizing mean squared error of conditional coverage through a refined quantile regression approach using a density-weighted pinball loss.

Details

Motivation: Standard conformal prediction provides marginal coverage guarantees but struggles with reliable conditional coverage for specific inputs. While exact distribution-free conditional coverage is impossible with finite samples, there's a need to improve conditional coverage of existing conformal methods beyond relaxed notions.

Method: The authors derive a density-weighted pinball loss for quantile regression using Taylor expansion, where weights are given by the conditional density of conformity scores at the true quantile. They propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at 1-α±δ, then fine-tunes the central quantile by optimizing the weighted loss.

Result: Theoretical analysis provides exact non-asymptotic guarantees characterizing the excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance compared to standard approaches.

Conclusion: The proposed method effectively improves conditional coverage in conformal prediction by directly minimizing the mean squared error of conditional coverage through a novel density-weighted quantile regression approach, with both theoretical guarantees and empirical validation.

Abstract: While conformal prediction provides robust marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Although exact distribution-free conditional coverage is impossible with finite samples, recent work has focused on improving the conditional coverage of standard conformal procedures. Distinct from approaches that target relaxed notions of conditional coverage, we directly minimize the mean squared error of conditional coverage by refining the quantile regression components that underpin many conformal methods. Leveraging a Taylor expansion, we derive a sharp surrogate objective for quantile regression: a density-weighted pinball loss, where the weights are given by the conditional density of the conformity score evaluated at the true quantile. We propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at (1-α\pm δ), subsequently fine-tuning the central quantile by optimizing the weighted loss. We provide a theoretical analysis with exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

[405] Paired Seed Evaluation: Statistical Reliability for Learning-Based Simulators

Udit Sharma

Main category: cs.LG

TL;DR: Paired seed evaluation design improves statistical efficiency in ML system comparisons by using identical random seeds across competing systems, reducing variance through seed-level correlation.

Details

Motivation: Machine learning systems appear stochastic but are deterministically random due to seeded pseudorandom number generators. Current evaluation methods for comparing algorithms, design choices, and interventions suffer from high variance due to random initialization and learning stochasticity, leading to inefficient statistical comparisons.

Method: The paper formalizes a paired seed evaluation design where competing systems are evaluated under identical random seeds, inducing matched realizations of stochastic components. This creates strict variance reduction when outcomes are positively correlated at the seed level, unlike standard independent evaluation designs that fail to exploit shared randomness.

Result: Paired seed evaluation yields tighter confidence intervals, higher statistical power, and effective sample size gains at fixed computational budgets. Empirical analysis shows seed-level correlations are typically large and positive, producing order-of-magnitude efficiency gains.

Conclusion: Paired seed evaluation is weakly dominant in practice - it improves statistical reliability when correlation is present and reduces to independent evaluation without loss of validity when correlation is absent, making it a superior approach for comparative evaluation of ML systems.

Abstract: Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across executions. Learning-based simulators are widely used to compare algorithms, design choices, and interventions under such dynamics, yet evaluation outcomes often exhibit high variance due to random initialisation and learning stochasticity. We analyse the statistical structure of comparative evaluation in these settings and show that standard independent evaluation designs fail to exploit shared sources of randomness across alternatives. We formalise a paired seed evaluation design in which competing systems are evaluated under identical random seeds, inducing matched realisations of stochastic components and strict variance reduction whenever outcomes are positively correlated at the seed level. This yields tighter confidence intervals, higher statistical power, and effective sample size gains at fixed computational budgets. Empirically, seed-level correlations are typically large and positive, producing order-of-magnitude efficiency gains. Paired seed evaluation is weakly dominant in practice, improving statistical reliability when correlation is present and reducing to independent evaluation without loss of validity when it is not.

[406] Micro-Macro Tensor Neural Surrogates for Uncertainty Quantification in Collisional Plasma

Wei Chen, Giacomo Dimarco, Lorenzo Pareschi

Main category: cs.LG

TL;DR: A variance-reduced Monte Carlo framework using neural network surrogates for uncertainty quantification in the Vlasov-Poisson-Landau system, achieving substantial computational savings while maintaining accuracy.

Details

Motivation: Plasma kinetic equations are highly sensitive to microscopic perturbations, requiring reliable uncertainty quantification, but traditional methods face severe challenges from high computational costs, high-dimensional phase space, multiscale stiffness, and complex collision terms.

Method: Couples high-fidelity VPL solver with inexpensive neural network surrogates (VPFP and EP models) using a generalized separable physics-informed neural network (SPINN) with anisotropic micro-macro decomposition to reduce dimensionality. Calibrates VPFP model and designs asymptotic-preserving SPINN to improve correlation with VPL.

Result: Substantial variance reduction over standard Monte Carlo, accurate statistics with far fewer high-fidelity samples, lower wall-clock time, and robustness to stochastic dimension.

Conclusion: The proposed framework successfully addresses computational challenges in plasma kinetic uncertainty quantification by combining variance reduction techniques with neural network surrogates, enabling efficient and accurate UQ for complex VPL systems.

Abstract: Plasma kinetic equations exhibit pronounced sensitivity to microscopic perturbations in model parameters and data, making reliable and efficient uncertainty quantification (UQ) essential for predictive simulations. However, the cost of uncertainty sampling, the high-dimensional phase space, and multiscale stiffness pose severe challenges to both computational efficiency and error control in traditional numerical methods. These aspects are further emphasized in presence of collisions where the high-dimensional nonlocal collision integrations and conservation properties pose severe constraints. To overcome this, we present a variance-reduced Monte Carlo framework for UQ in the Vlasov–Poisson–Landau (VPL) system, in which neural network surrogates replace the multiple costly evaluations of the Landau collision term. The method couples a high-fidelity, asymptotic-preserving VPL solver with inexpensive, strongly correlated surrogates based on the Vlasov–Poisson–Fokker–Planck (VPFP) and Euler–Poisson (EP) equations. For the surrogate models, we introduce a generalization of the separable physics-informed neural network (SPINN), developing a class of tensor neural networks based on an anisotropic micro-macro decomposition, to reduce velocity-moment costs, model complexity, and the curse of dimensionality. To further increase correlation with VPL, we calibrate the VPFP model and design an asymptotic-preserving SPINN whose small- and large-Knudsen limits recover the EP and VP systems, respectively. Numerical experiments show substantial variance reduction over standard Monte Carlo, accurate statistics with far fewer high-fidelity samples, and lower wall-clock time, while maintaining robustness to stochastic dimension.

[407] Early Prediction of Sepsis using Heart Rate Signals and Genetic Optimized LSTM Algorithm

Alireza Rafiei, Farshid Hajati, Alireza Rezaee, Amirhossien Panahi, Shahadat Uddin

Main category: cs.LG

TL;DR: Researchers developed four machine learning models optimized for wearable devices to predict sepsis onset using heart rate data, with genetic algorithm optimization for performance and efficiency.

Details

Motivation: Sepsis causes high mortality and healthcare costs, but existing prediction models focus mainly on ICU patients, leaving a gap for early detection in non-ward settings using wearable technology.

Method: Developed four novel ML algorithms for sepsis prediction on wearables using heart rate data, optimized architecture with genetic algorithm for performance/computational efficiency, and used transfer learning to extend prediction window from 1 to 4 hours.

Result: Models showed promising performance for wearable implementation, with optimized computational complexity and memory requirements suitable for accurate heart rate monitoring devices.

Conclusion: Wearable technology has potential for early sepsis detection outside ICU/ward environments, enabling timely intervention to reduce adverse outcomes.

Abstract: Sepsis, characterized by a dysregulated immune response to infection, results in significant mortality, morbidity, and healthcare costs. The timely prediction of sepsis progression is crucial for reducing adverse outcomes through early intervention. Despite the development of numerous models for Intensive Care Unit (ICU) patients, there remains a notable gap in approaches for the early detection of sepsis in non-ward settings. This research introduces and evaluates four novel machine learning algorithms designed for predicting the onset of sepsis on wearable devices by analyzing heart rate data. The architecture of these models was refined through a genetic algorithm, optimizing for performance, computational complexity, and memory requirements. Performance metrics were subsequently extracted for each model to evaluate their feasibility for implementation on wearable devices capable of accurate heart rate monitoring. The models were initially tailored for a prediction window of one hour, later extended to four hours through transfer learning. The encouraging outcomes of this study suggest the potential for wearable technology to facilitate early sepsis detection outside ICU and ward environments.

Haojin Li, Anbang Zhang, Chen Sun, Chenyuan Feng, Kaiqian Qu, Tony Q. S. Quek, Haijun Zhang

Main category: cs.LG

TL;DR: SaM2B is a semantic-aware multi-modal beam prediction framework for UAV communications that uses reliability-aware dynamic weighting and cross-modal contrastive learning to improve beam prediction accuracy across different UAV motion scenarios.

Details

Motivation: Current multi-modal beam prediction methods use fixed weights assuming equal modality reliability, but modality importance fluctuates with UAV motion. Static weighting amplifies degraded modality impact, and modal mismatch/weak alignment hurt cross-scenario generalization.

Method: Proposes SaM2B framework with: 1) Reliability-aware dynamic weighting scheme that adaptively allocates contributions across modalities (environmental visual, flight posture, geospatial data) based on time-varying reliability; 2) Cross-modal contrastive learning to align multi-source representation beam semantics to shared semantic space for better discrimination and robustness.

Result: Experiments on real-world low-altitude UAV datasets show SaM2B achieves more satisfactory results than baseline methods.

Conclusion: SaM2B effectively addresses modality reliability fluctuations and alignment issues in UAV beam prediction, demonstrating superior performance through adaptive weighting and semantic alignment techniques.

Abstract: The low-altitude economy (LAE) is rapidly expanding driven by urban air mobility, logistics drones, and aerial sensing, while fast and accurate beam prediction in uncrewed aerial vehicles (UAVs) communications is crucial for achieving reliable connectivity. Current research is shifting from single-signal to multi-modal collaborative approaches. However, existing multi-modal methods mostly employ fixed or empirical weights, assuming equal reliability across modalities at any given moment. Indeed, the importance of different modalities fluctuates dramatically with UAV motion scenarios, and static weighting amplifies the negative impact of degraded modalities. Furthermore, modal mismatch and weak alignment further undermine cross-scenario generalization. To this end, we propose a reliability-aware dynamic weighting scheme applied to a semantic-aware multi-modal beam prediction framework, named SaM2B. Specifically, SaM2B leverages lightweight cues such as environmental visual, flight posture, and geospatial data to adaptively allocate contributions across modalities at different time points through reliability-aware dynamic weight updates. Moreover, by utilizing cross-modal contrastive learning, we align the “multi-source representation beam semantics” associated with specific beam information to a shared semantic space, thereby enhancing discriminative power and robustness under modal noise and distribution shifts. Experiments on real-world low-altitude UAV datasets show that SaM2B achieves more satisfactory results than baseline methods.

[409] Tubular Riemannian Laplace Approximations for Bayesian Neural Networks

Rodrigo Pereira David

Main category: cs.LG

TL;DR: TRL introduces a tubular Riemannian Laplace approximation that models neural network posteriors as probabilistic tubes following low-loss valleys, achieving ensemble-grade calibration with 1/5 the training cost of Deep Ensembles.

Details

Motivation: Euclidean Laplace approximations struggle with the anisotropic, curved loss surfaces and large symmetry groups in modern deep neural networks, failing to adapt to the complex geometric structure of these models.

Method: TRL models the posterior as a probabilistic tube following low-loss valleys induced by functional symmetries. It uses Fisher/Gauss-Newton metrics to separate prior-dominated tangential uncertainty from data-dominated transverse uncertainty, operating as a scalable reparametrized Gaussian approximation with implicit curvature estimates.

Result: On ResNet-18 (CIFAR-10 and CIFAR-100), TRL achieves excellent calibration, matching or exceeding Deep Ensembles in terms of Expected Calibration Error (ECE) while requiring only 1/5 of the training cost.

Conclusion: TRL effectively bridges the gap between single-model efficiency and ensemble-grade reliability, providing a practical method for approximate Bayesian inference that adapts to the geometric structure of modern deep neural networks.

Abstract: Laplace approximations are among the simplest and most practical methods for approximate Bayesian inference in neural networks, yet their Euclidean formulation struggles with the highly anisotropic, curved loss surfaces and large symmetry groups that characterize modern deep models. Recent work has proposed Riemannian and geometric Gaussian approximations to adapt to this structure. Building on these ideas, we introduce the Tubular Riemannian Laplace (TRL) approximation. TRL explicitly models the posterior as a probabilistic tube that follows a low-loss valley induced by functional symmetries, using a Fisher/Gauss-Newton metric to separate prior-dominated tangential uncertainty from data-dominated transverse uncertainty. We interpret TRL as a scalable reparametrised Gaussian approximation that utilizes implicit curvature estimates to operate in high-dimensional parameter spaces. Our empirical evaluation on ResNet-18 (CIFAR-10 and CIFAR-100) demonstrates that TRL achieves excellent calibration, matching or exceeding the reliability of Deep Ensembles (in terms of ECE) while requiring only a fraction (1/5) of the training cost. TRL effectively bridges the gap between single-model efficiency and ensemble-grade reliability.

[410] Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning

Soham Pahari, M. Srinivas

Main category: cs.LG

TL;DR: ViReLoc is a visual reasoning framework for navigation and localization that uses only visual representations without GPS, improving spatial reasoning through geometric understanding and cross-view alignment.

Details

Motivation: Current multimodal reasoning systems rely too heavily on textual information, limiting their effectiveness in spatial tasks like navigation and geo-localization where visual and geometric understanding is crucial.

Method: Proposes Geo-Consistent Visual Planning framework (ViReLoc) that learns spatial dependencies and geometric relations through visual representations, uses step-by-step visual inference, reinforcement learning objectives, contrastive learning, and adaptive feature interaction to align cross-view perspectives.

Result: Experiments show consistent improvements in spatial reasoning accuracy and cross-view retrieval performance across diverse navigation and localization scenarios.

Conclusion: Visual reasoning serves as a strong complementary approach for navigation and localization, enabling GPS-free solutions that are more secure while maintaining effectiveness in spatial tasks.

Abstract: Multimodal intelligence development recently show strong progress in visual understanding and high level reasoning. Though, most reasoning system still reply on textual information as the main medium for inference. This limit their effectiveness in spatial tasks such as visual navigation and geo-localization. This work discuss about the potential scope of this field and eventually propose an idea visual reasoning paradigm Geo-Consistent Visual Planning, our introduced framework called Visual Reasoning for Localization, or ViReLoc, which performs planning and localization using only visual representations. The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand. By encoding step by step inference in the visual domain and optimizing with reinforcement based objectives, ViReLoc plans routes between two given ground images. The system also integrates contrastive learning and adaptive feature interaction to align cross view perspectives and reduce viewpoint differences. Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance. These results establish visual reasoning as a strong complementary approach for navigation and localization, and show that such tasks can be performed without real time global positioning system data, leading to more secure navigation solutions.

[411] Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models

Lars van der Laan, Aurelien Bibaut, Nathan Kallus

Main category: cs.LG

TL;DR: A semiparametric framework for debiased inverse reinforcement learning that enables statistically efficient inference for reward-dependent functionals in IRL and DDC models, allowing flexible nonparametric estimation while achieving √n-consistency and semiparametric efficiency.

Details

Motivation: Existing IRL methods rely on machine learning but lack guarantees for valid inference, while classical DDC approaches impose restrictive parametric specifications and require repeated dynamic programming. There's a need for a framework that combines the flexibility of machine learning with statistical guarantees.

Method: Develops a semiparametric framework showing that the log-behavior policy acts as a pseudo-reward that identifies policy value differences and the reward itself. Formalizes targets as smooth functionals of behavior policy and transition kernel, establishes pathwise differentiability, derives efficient influence functions, and constructs automatic debiased machine-learning estimators.

Result: Achieves √n-consistency, asymptotic normality, and semiparametric efficiency while allowing flexible nonparametric estimation of nuisance components. Extends classical inference for DDC models to nonparametric rewards and modern machine-learning tools.

Conclusion: Provides a unified and computationally tractable approach to statistical inference in IRL that bridges the gap between flexible machine learning methods and statistically valid inference, enabling reliable inference for reward functions in sequential decision-making problems.

Abstract: Inverse reinforcement learning (IRL) and dynamic discrete choice (DDC) models explain sequential decision-making by recovering reward functions that rationalize observed behavior. Flexible IRL methods typically rely on machine learning but provide no guarantees for valid inference, while classical DDC approaches impose restrictive parametric specifications and often require repeated dynamic programming. We develop a semiparametric framework for debiased inverse reinforcement learning that yields statistically efficient inference for a broad class of reward-dependent functionals in maximum entropy IRL and Gumbel-shock DDC models. We show that the log-behavior policy acts as a pseudo-reward that point-identifies policy value differences and, under a simple normalization, the reward itself. We then formalize these targets, including policy values under known and counterfactual softmax policies and functionals of the normalized reward, as smooth functionals of the behavior policy and transition kernel, establish pathwise differentiability, and derive their efficient influence functions. Building on this characterization, we construct automatic debiased machine-learning estimators that allow flexible nonparametric estimation of nuisance components while achieving $\sqrt{n}$-consistency, asymptotic normality, and semiparametric efficiency. Our framework extends classical inference for DDC models to nonparametric rewards and modern machine-learning tools, providing a unified and computationally tractable approach to statistical inference in IRL.

[412] Sparse classification with positive-confidence data in high dimensions

The Tien Mai, Mai Anh Nguyen, Trung Nghia Nguyen

Main category: cs.LG

TL;DR: Proposes sparse regularization methods for high-dimensional Positive-Confidence (Pconf) classification, using Lasso, SCAD, and MCP penalties to achieve near-minimax optimal sparse recovery rates comparable to fully supervised approaches.

Details

Motivation: High-dimensional learning problems require sparse regularization, but existing techniques are underexplored in weak-supervision settings like Pconf classification. Pconf learning uses only positive samples with confidence scores, avoiding negative data, but current Pconf methods are ill-suited for high-dimensional regimes.

Method: Proposes a novel sparse-penalization framework for high-dimensional Pconf classification using convex (Lasso) and non-convex (SCAD, MCP) penalties to address shrinkage bias and improve feature recovery. Develops an efficient proximal gradient algorithm to solve the composite objective.

Result: Establishes theoretical estimation and prediction error bounds for L1-regularized Pconf estimator, proving it achieves near minimax-optimal sparse recovery rates under Restricted Strong Convexity condition. Extensive simulations show predictive performance and variable selection accuracy comparable to fully supervised approaches.

Conclusion: The proposed methods effectively bridge the gap between weak supervision and high-dimensional statistics, enabling effective sparse regularization in Pconf classification settings where only positive samples with confidence scores are available.

Abstract: High-dimensional learning problems, where the number of features exceeds the sample size, often require sparse regularization for effective prediction and variable selection. While established for fully supervised data, these techniques remain underexplored in weak-supervision settings such as Positive-Confidence (Pconf) classification. Pconf learning utilizes only positive samples equipped with confidence scores, thereby avoiding the need for negative data. However, existing Pconf methods are ill-suited for high-dimensional regimes. This paper proposes a novel sparse-penalization framework for high-dimensional Pconf classification. We introduce estimators using convex (Lasso) and non-convex (SCAD, MCP) penalties to address shrinkage bias and improve feature recovery. Theoretically, we establish estimation and prediction error bounds for the L1-regularized Pconf estimator, proving it achieves near minimax-optimal sparse recovery rates under Restricted Strong Convexity condition. To solve the resulting composite objective, we develop an efficient proximal gradient algorithm. Extensive simulations demonstrate that our proposed methods achieve predictive performance and variable selection accuracy comparable to fully supervised approaches, effectively bridging the gap between weak supervision and high-dimensional statistics.

[413] Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics

Akash Samanta, Sheldon Williamson

Main category: cs.LG

TL;DR: A diagnostic-driven adaptive learning framework that decomposes error evolution into bias, noise, and alignment components to provide stable, interpretable adaptation in dynamic environments across supervised optimization, RL, and learned optimizers.

Details

Motivation: Current learning methods in nonstationary, safety-critical environments suffer from instability, slow convergence, or brittle adaptation because they ignore the temporal structure of error signals while focusing only on gradient statistics.

Method: Proposes a diagnostic-driven framework that decomposes error evolution into three components: bias (persistent drift), noise (stochastic variability), and alignment (repeated directional excitation causing overshoot). These diagnostics are computed online from lightweight statistics of loss or temporal-difference error trajectories, independent of model architecture or task domain.

Result: The bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic RL, and learned optimizers. Derived instantiations include a stabilized supervised optimizer, diagnostic-regulated actor-critic scheme, and diagnostic-conditioned learned optimizer, all with bounded effective updates and stability properties under standard smoothness assumptions.

Conclusion: This work elevates error evolution to a first-class object in adaptive learning, providing an interpretable, lightweight foundation for reliable learning in dynamic environments by explicitly modeling temporal error structure.

Abstract: Learning systems deployed in nonstationary and safety-critical environments often suffer from instability, slow convergence, or brittle adaptation when learning dynamics evolve over time. While modern optimization, reinforcement learning, and meta-learning methods adapt to gradient statistics, they largely ignore the temporal structure of the error signal itself. This paper proposes a diagnostic-driven adaptive learning framework that explicitly models error evolution through a principled decomposition into bias, capturing persistent drift; noise, capturing stochastic variability; and alignment, capturing repeated directional excitation leading to overshoot. These diagnostics are computed online from lightweight statistics of loss or temporal-difference error trajectories and are independent of model architecture or task domain. We show that the proposed bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic reinforcement learning, and learned optimizers. Building on this framework, we derive diagnostic-driven instantiations including a stabilized supervised optimizer, a diagnostic-regulated actor-critic scheme, and a diagnostic-conditioned learned optimizer. Under standard smoothness assumptions, we establish bounded effective updates and stability properties for all cases. Representative diagnostic illustrations in actor-critic learning highlight how the proposed signals modulate adaptation in response to temporal-difference error structure. Overall, this work elevates error evolution to a first-class object in adaptive learning and provides an interpretable, lightweight foundation for reliable learning in dynamic environments.

[414] Generative forecasting with joint probability models

Patrick Wyrod, Ashesh Chattopadhyay, Daniele Venturi

Main category: cs.LG

TL;DR: The paper proposes reframing chaotic system forecasting as a joint generative modeling problem rather than conditional next-step prediction, enabling better uncertainty quantification and long-range statistical accuracy.

Details

Motivation: Chaotic systems have fundamental forecasting limitations due to sensitivity to initial conditions and unresolved multiscale processes. Existing generative models focus too narrowly on next-step conditional prediction rather than capturing underlying dynamic structure.

Method: Learn joint probability distribution of lagged system states over short temporal windows, then obtain forecasts through marginalization. Introduce model-agnostic training/inference framework with three uncertainty metrics: ensemble variance, short-horizon autocorrelation, and cumulative Wasserstein drift.

Result: Joint generative models outperform conventional conditional next-step models on Lorenz-63 and Kuramoto-Sivashinsky systems, showing improved short-term predictive skill, preserved attractor geometry, and substantially more accurate long-range statistical behavior.

Conclusion: Reframing forecasting as joint generative modeling captures nonlinear temporal dependencies better, enables robust uncertainty quantification without ground truth, and improves both short-term and long-range forecasting of chaotic systems.

Abstract: Chaotic dynamical systems exhibit strong sensitivity to initial conditions and often contain unresolved multiscale processes, making deterministic forecasting fundamentally limited. Generative models offer an appealing alternative by learning distributions over plausible system evolutions; yet, most existing approaches focus on next-step conditional prediction rather than the structure of the underlying dynamics. In this work, we reframe forecasting as a fully generative problem by learning the joint probability distribution of lagged system states over short temporal windows and obtaining forecasts through marginalization. This new perspective allows the model to capture nonlinear temporal dependencies, represent multistep trajectory segments, and produce next-step predictions consistent with the learned joint distribution. We also introduce a general, model-agnostic training and inference framework for joint generative forecasting and show how it enables assessment of forecast robustness and reliability using three complementary uncertainty quantification metrics (ensemble variance, short-horizon autocorrelation, and cumulative Wasserstein drift), without access to ground truth. We evaluate the performance of the proposed method on two canonical chaotic dynamical systems, the Lorenz-63 system and the Kuramoto-Sivashinsky equation, and show that joint generative models yield improved short-term predictive skill, preserve attractor geometry, and achieve substantially more accurate long-range statistical behaviour than conventional conditional next-step models.

[415] HOLOGRAPH: Active Causal Discovery via Sheaf-Theoretic Alignment of Large Language Model Priors

Hyunjun Kim

Main category: cs.LG

TL;DR: HOLOGRAPH is a framework that formalizes LLM-guided causal discovery using sheaf theory, representing local causal beliefs as sections of a presheaf over variable subsets, with coherent global structure corresponding to global sections and topological obstructions manifesting as non-vanishing sheaf cohomology.

Details

Motivation: Causal discovery from observational data is fundamentally limited by identifiability constraints. Existing approaches using LLMs as sources of prior causal knowledge rely on heuristic integration that lacks theoretical grounding.

Method: Uses sheaf theory to represent local causal beliefs as sections of a presheaf over variable subsets. Introduces Algebraic Latent Projection to handle hidden confounders and Natural Gradient Descent on the belief manifold for principled optimization. Formalizes LLM-guided causal discovery through mathematical foundations.

Result: Experiments on synthetic and real-world benchmarks show competitive performance on causal discovery tasks with 50-100 variables. Sheaf-theoretic analysis reveals that Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), but the Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections.

Conclusion: HOLOGRAPH provides rigorous mathematical foundations for LLM-guided causal discovery through sheaf theory, achieving competitive performance while revealing fundamental limitations in locality assumptions for larger graphs with hidden confounders.

Abstract: Causal discovery from observational data remains fundamentally limited by identifiability constraints. Recent work has explored leveraging Large Language Models (LLMs) as sources of prior causal knowledge, but existing approaches rely on heuristic integration that lacks theoretical grounding. We introduce HOLOGRAPH, a framework that formalizes LLM-guided causal discovery through sheaf theory–representing local causal beliefs as sections of a presheaf over variable subsets. Our key insight is that coherent global causal structure corresponds to the existence of a global section, while topological obstructions manifest as non-vanishing sheaf cohomology. We propose the Algebraic Latent Projection to handle hidden confounders and Natural Gradient Descent on the belief manifold for principled optimization. Experiments on synthetic and real-world benchmarks demonstrate that HOLOGRAPH provides rigorous mathematical foundations while achieving competitive performance on causal discovery tasks with 50-100 variables. Our sheaf-theoretic analysis reveals that while Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), the Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections. Code is available at https://github.com/hyunjun1121/holograph.

[416] Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal

Main category: cs.LG

TL;DR: Standard small-scale data recipe experiments using fixed training configurations produce unreliable conclusions that don’t transfer to full-scale training; using reduced learning rates for proxy models yields results that correlate with fully-tuned large-scale LLM pretraining.

Details

Motivation: Data teams at AI companies use small proxy models to make critical decisions about pretraining data recipes, but there's limited understanding of whether conclusions from small-scale experiments reliably transfer to full-scale model training. The standard "fair comparison" protocol using identical training configurations across all data recipes is problematic.

Method: The paper introduces a simple patch to the evaluation protocol: using reduced learning rates for proxy model training instead of fixed configurations. This approach aims to identify data recipes that yield the best performance under data-specific tuning. Theoretically, the authors prove that for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable loss.

Result: The reduced learning rate approach yields relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs. Empirical validation across 23 data recipes covering four critical dimensions of data curation demonstrates dramatic improvements in the reliability of small-scale experiments.

Conclusion: The objective of data recipe assessment should be to identify the recipe that yields the best performance under data-specific tuning, not under fixed configurations. The proposed reduced learning rate protocol provides a practical, low-cost solution that improves the reliability of small-scale experiments for guiding full-scale model training decisions.

Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment: the use of identical small-scale model training configurations across all data recipes in the name of “fair” comparison. We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyperparameters, as the optimal training configuration is inherently data-dependent. Moreover, this fixed-configuration protocol diverges from full-scale model development pipelines, where hyperparameter optimization is a standard step. Consequently, we posit that the objective of data recipe assessment should be to identify the recipe that yields the best performance under data-specific tuning. To mitigate the high cost of hyperparameter tuning, we introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training. We show that this approach yields relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs. Theoretically, we prove that for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable loss. Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation, demonstrating dramatic improvements in the reliability of small-scale experiments.

[417] Generalising E-prop to Deep Networks

Beren Millidge

Main category: cs.LG

TL;DR: Extends E-prop to deep recurrent networks, enabling online credit assignment across both time and depth without backpropagation through time.

Details

Motivation: Existing biologically plausible learning algorithms (RTRL, E-prop) only handle single-layer recurrent networks, but real brain learning spans multiple layers with hierarchical dynamics in both depth and time. BPTT is biologically implausible due to its requirement to store and replay all states backwards.

Method: Extends the E-prop framework to arbitrarily deep networks by deriving a novel recursion relationship across depth that extends eligibility traces to deeper layers, maintaining online forward updates without backpropagation through time.

Result: Develops an online learning algorithm that can perform accurate credit assignment across both time and depth simultaneously, enabling training of deep recurrent networks without BPTT.

Conclusion: Demonstrates that biologically plausible online learning can handle deep recurrent architectures, bridging the gap between single-layer recurrent learning models and the multi-layer hierarchical learning observed in biological brains.

Abstract: Recurrent networks are typically trained with backpropagation through time (BPTT). However, BPTT requires storing the history of all states in the network and then replaying them sequentially backwards in time. This computation appears extremely implausible for the brain to implement. Real Time Recurrent Learning (RTRL) proposes an mathematically equivalent alternative where gradient information is propagated forwards in time locally alongside the regular forward pass, however it has significantly greater computational complexity than BPTT which renders it impractical for large networks. E-prop proposes an approximation of RTRL which reduces its complexity to the level of BPTT while maintaining a purely online forward update which can be implemented by an eligibility trace at each synapse. However, works on RTRL and E-prop ubiquitously investigate learning in a single layer with recurrent dynamics. However, learning in the brain spans multiple layers and consists of both hierarchal dynamics in depth as well as time. In this mathematical note, we extend the E-prop framework to handle arbitrarily deep networks, deriving a novel recursion relationship across depth which extends the eligibility traces of E-prop to deeper layers. Our results thus demonstrate an online learning algorithm can perform accurate credit assignment across both time and depth simultaneously, allowing the training of deep recurrent networks without backpropagation through time.

[418] More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization

Yuma Ichikawa, Yoshihiko Fujisawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa

Main category: cs.LG

TL;DR: MDBF improves extreme low-bit LLM quantization by replacing DBF’s restrictive single envelope with rank-l envelopes while keeping binary sign bases, boosting accuracy without changing inference efficiency.

Details

Motivation: Double Binary Factorization (DBF) is attractive for efficient LLM inference but has restrictive scaling parameters where all rank components share the same magnitude profile, causing performance saturation.

Method: Propose Multi-envelope DBF (MDBF) that retains shared 1-bit sign bases but replaces single envelope with rank-l envelope. Uses closed-form initialization and alternating refinement optimization while maintaining binary carrier and deployment-friendly inference.

Result: MDBF enhances perplexity and zero-shot accuracy over previous binary formats across LLaMA and Qwen families at matched bits per weight while preserving same inference primitive.

Conclusion: MDBF successfully addresses DBF’s limitations by improving magnitude expressiveness within memory budget while maintaining efficient binary inference, achieving better accuracy for extreme low-bit LLM quantization.

Abstract: For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.

[419] From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme

Xueyan Li, Yingyi Xue, Mengjie Jiang, Qingzi Zhu, Yazhe Niu

Main category: cs.LG

TL;DR: HUMOR is a framework that enhances VLMs for humorous meme generation through hierarchical reasoning and group-wise preference alignment.

Details

Motivation: Generating humorous memes requires nuanced multimodal reasoning beyond simple image-to-caption tasks, involving visual content understanding, contextual cues, and subjective humor perception.

Method: 1) Hierarchical multi-path Chain-of-Thought reasoning: template-level intent identification → diverse reasoning path exploration → anchoring to high-quality context-specific paths. 2) Group-wise pairwise reward model training for subjective humor capture. 3) Group-wise reinforcement learning optimization with theoretical guarantees.

Result: HUMOR empowers various VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality in extensive experiments.

Conclusion: The work presents a general training paradigm for open-ended, human-aligned multimodal generation where success is guided by comparative judgment within coherent output groups, applicable beyond memes.

Abstract: Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR}, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human preferences. First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path. This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity. We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the practical condition that high-quality paths retain significant probability mass. Second, to capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template. Following established theory, this approach ensures a consistent and robust proxy for human preference, even with subjective and noisy labels. The reward model then enables a group-wise reinforcement learning optimization, guaranteeing providing a theoretical guarantee for monotonic improvement within the trust region. Extensive experiments show that HUMOR empowers various VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality. Beyond memes, our work presents a general training paradigm for open-ended, human-aligned multimodal generation, where success is guided by comparative judgment within coherent output group.

[420] CPR: Causal Physiological Representation Learning for Robust ECG Analysis under Distribution Shifts

Shunbo Jia, Caizhi Liao

Main category: cs.LG

TL;DR: CPR (Causal Physiological Representation Learning) is a novel defense method for ECG diagnosis models that uses causal disentanglement to separate invariant pathological features from artifacts, achieving robust performance against adversarial attacks with single-pass inference efficiency.

Details

Motivation: Deep learning ECG models are vulnerable to adversarial perturbations, especially Smooth Adversarial Perturbations (SAP) that mimic biological morphology. Existing defenses face a dilemma: Adversarial Training is computationally prohibitive, while certified methods like Randomized Smoothing introduce significant inference latency, making them impractical for real-time clinical monitoring. The vulnerability stems from models relying on non-robust spurious correlations rather than invariant pathological features.

Method: CPR incorporates a Physiological Structural Prior within a causal disentanglement framework. It models ECG generation via a Structural Causal Model (SCM) and enforces a structural intervention that strictly separates invariant pathological morphology (P-QRS-T complex) from non-causal artifacts. Unlike standard denoising approaches, CPR operates with semantic constraints to learn robust physiological representations.

Result: On PTB-XL dataset, CPR significantly outperforms standard clinical preprocessing methods. Under SAP attacks, CPR achieves an F1 score of 0.632, surpassing Median Smoothing (0.541 F1) by 9.1%. Crucially, CPR matches the certified robustness of Randomized Smoothing while maintaining single-pass inference efficiency.

Conclusion: CPR offers a superior trade-off between robustness, efficiency, and clinical interpretability for ECG diagnosis models. It addresses the critical dilemma in adversarial defense by providing certified robustness without the computational burden of Adversarial Training or the inference latency of Randomized Smoothing, making it practical for real-time clinical monitoring applications.

Abstract: Deep learning models for Electrocardiogram (ECG) diagnosis have achieved remarkable accuracy but exhibit fragility against adversarial perturbations, particularly Smooth Adversarial Perturbations (SAP) that mimic biological morphology. Existing defenses face a critical dilemma: Adversarial Training (AT) provides robustness but incurs a prohibitive computational burden, while certified methods like Randomized Smoothing (RS) introduce significant inference latency, rendering them impractical for real-time clinical monitoring. We posit that this vulnerability stems from the models’ reliance on non-robust spurious correlations rather than invariant pathological features. To address this, we propose Causal Physiological Representation Learning (CPR). Unlike standard denoising approaches that operate without semantic constraints, CPR incorporates a Physiological Structural Prior within a causal disentanglement framework. By modeling ECG generation via a Structural Causal Model (SCM), CPR enforces a structural intervention that strictly separates invariant pathological morphology (P-QRS-T complex) from non-causal artifacts. Empirical results on PTB-XL demonstrate that CPR significantly outperforms standard clinical preprocessing methods. Specifically, under SAP attacks, CPR achieves an F1 score of 0.632, surpassing Median Smoothing (0.541 F1) by 9.1%. Crucially, CPR matches the certified robustness of Randomized Smoothing while maintaining single-pass inference efficiency, offering a superior trade-off between robustness, efficiency, and clinical interpretability.

[421] Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang

Main category: cs.LG

TL;DR: DLCM is a hierarchical language modeling framework that compresses tokens into variable-length concepts for more efficient computation, achieving better performance with the same FLOPs.

Details

Motivation: Current LLMs apply uniform computation to all tokens, which is inefficient since language has non-uniform information density - predictable spans waste capacity while critical transitions get insufficient computation.

Method: DLCM learns semantic boundaries from latent representations to compress tokens into variable-length concepts, shifting computation to a more efficient concept space. Uses hierarchical compression, compression-aware scaling laws, and decoupled μP parametrization for stable training.

Result: With R=4 compression (4 tokens per concept), DLCM reallocates ~1/3 of inference compute to higher-capacity reasoning backbone, achieving +2.69% average improvement across 12 zero-shot benchmarks under matched inference FLOPs.

Conclusion: Hierarchical compression with dynamic concept formation enables more efficient compute allocation in language models, fundamentally changing scaling behavior and improving performance without increasing FLOPs.

Abstract: Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.

[422] AutoFed: Manual-Free Federated Traffic Prediction via Personalized Prompt

Zijian Zhao, Yitong Shang, Sen Li

Main category: cs.LG

TL;DR: AutoFed is a novel Personalized Federated Learning framework for traffic prediction that eliminates manual hyper-parameter tuning through prompt learning and client-aligned adapters.

Details

Motivation: Traffic prediction is crucial for Intelligent Transportation Systems but faces privacy concerns leading to data silos. Federated Learning helps with privacy but struggles with non-IID data across clients. Personalized FL addresses this but requires specialized adaptation for traffic tasks and often relies on impractical hyper-parameter optimization across datasets.

Method: AutoFed introduces a federated representor with client-aligned adapters that distill local data into a compact, globally shared prompt matrix inspired by prompt learning. This prompt conditions a personalized predictor, enabling cross-client knowledge sharing while maintaining local specificity without manual hyper-parameter tuning.

Result: Extensive experiments on real-world datasets show AutoFed consistently achieves superior performance across diverse scenarios compared to existing methods.

Conclusion: AutoFed provides an effective PFL framework for traffic prediction that overcomes practical deployment limitations by eliminating the need for manual hyper-parameter tuning while enabling privacy-preserving knowledge sharing across clients.

Abstract: Accurate traffic prediction is essential for Intelligent Transportation Systems, including ride-hailing, urban road planning, and vehicle fleet management. However, due to significant privacy concerns surrounding traffic data, most existing methods rely on local training, resulting in data silos and limited knowledge sharing. Federated Learning (FL) offers an efficient solution through privacy-preserving collaborative training; however, standard FL struggles with the non-independent and identically distributed (non-IID) problem among clients. This challenge has led to the emergence of Personalized Federated Learning (PFL) as a promising paradigm. Nevertheless, current PFL frameworks require further adaptation for traffic prediction tasks, such as specialized graph feature engineering, data processing, and network architecture design. A notable limitation of many prior studies is their reliance on hyper-parameter optimization across datasets-information that is often unavailable in real-world scenarios-thus impeding practical deployment. To address this challenge, we propose AutoFed, a novel PFL framework for traffic prediction that eliminates the need for manual hyper-parameter tuning. Inspired by prompt learning, AutoFed introduces a federated representor that employs a client-aligned adapter to distill local data into a compact, globally shared prompt matrix. This prompt then conditions a personalized predictor, allowing each client to benefit from cross-client knowledge while maintaining local specificity. Extensive experiments on real-world datasets demonstrate that AutoFed consistently achieves superior performance across diverse scenarios. The code of this paper is provided at https://github.com/RS2002/AutoFed .

[423] A Scalable Framework for logP Prediction: From Terabyte-Scale Data Integration to Interpretable Ensemble Modeling

Malikussaid, Septian Caesar Floresko, Ade Romadhony, Isman Kurniawan, Warih Maharani, Hilal Hudan Nuha

Main category: cs.LG

TL;DR: Large-scale logP prediction framework using 426,850 compounds from integrated databases, achieving 740x speedup in data processing and optimal performance with stratified ensemble models.

Details

Motivation: To develop a robust, scalable framework for lipophilicity (logP) prediction using large-scale, well-curated chemical data, addressing data integration challenges and providing actionable guidance for molecular design.

Method: Created computational infrastructure with byte-offset indexing for 740x faster data processing; evaluated multiple modeling approaches (linear models, Random Forest, XGBoost); implemented stratified modeling strategy with specialized models for drug-like molecules (91%) and extreme cases (9%).

Result: Tree-based ensembles (Random Forest, XGBoost) achieved R²=0.765 and RMSE=0.731; stratified modeling achieved optimal performance with RMSE=0.838 for drug-like subset and R²=0.767 for extreme molecules; molecular weight identified as most important predictor via SHAP analysis.

Conclusion: Well-curated descriptor-based ensemble models remain competitive with state-of-the-art graph neural networks; stratified modeling provides optimal performance; molecular weight is the most important global predictor despite weak bivariate correlation; framework establishes robust baselines for lipophilicity prediction.

Abstract: This study presents a large-scale predictive modeling framework for logP prediction using 426850 bioactive compounds rigorously curated from the intersection of three authoritative chemical databases: PubChem, ChEMBL, and eMolecules. We developed a novel computational infrastructure to address the data integration challenge, reducing processing time from a projected over 100 days to 3.2 hours through byte-offset indexing architecture, a 740-fold improvement. Our comprehensive analysis revealed critical insights into the multivariate nature of lipophilicity: while molecular weight exhibited weak bivariate correlation with logP, SHAP analysis on ensemble models identified it as the single most important predictor globally. We systematically evaluated multiple modeling approaches, discovering that linear models suffered from inherent heteroskedasticity that classical remediation strategies, including weighted least squares and Box-Cox transformation, failed to address. Tree-based ensemble methods, including Random Forest and XGBoost, proved inherently robust to this violation, achieving an R-squared of 0.765 and RMSE of 0.731 logP units on the test set. Furthermore, a stratified modeling strategy, employing specialized models for drug-like molecules (91 percent of dataset) and extreme cases (nine percent), achieved optimal performance: an RMSE of 0.838 for the drug-like subset and an R-squared of 0.767 for extreme molecules, the highest of all evaluated approaches. These findings provide actionable guidance for molecular design, establish robust baselines for lipophilicity prediction using only 2D descriptors, and demonstrate that well-curated, descriptor-based ensemble models remain competitive with state-of-the-art graph neural network architectures.

[424] HeteroHBA: A Generative Structure-Manipulating Backdoor Attack on Heterogeneous Graphs

Honglin Gao, Lan Zhao, Junhao Ren, Xiang Li, Gaoxi Xiao

Main category: cs.LG

TL;DR: HeteroHBA is a stealthy backdoor attack framework for heterogeneous graph neural networks that injects trigger nodes with diverse features/connections aligned to benign statistics, achieving high attack success while maintaining clean accuracy.

Details

Motivation: Backdoor attacks on heterogeneous graphs are understudied despite their practical risks. Current attacks may not effectively handle the complexity of heterogeneous graphs with diverse node types and relations.

Method: Uses saliency-based screening to select influential auxiliary neighbors for trigger attachment, synthesizes diverse trigger features/connections to match local context, employs AdaIN+MMD loss for stealthy feature alignment, and optimizes with bilevel objective for attack success and clean accuracy preservation.

Result: HeteroHBA consistently outperforms prior backdoor baselines on multiple real-world heterogeneous graphs with various HGNN architectures, achieving higher attack success with comparable/smaller impact on clean accuracy, and remains effective against heterogeneity-aware structural defense (CSD).

Conclusion: The work demonstrates practical backdoor risks in heterogeneous graph learning, highlighting the need for stronger defenses against such stealthy attacks that can bypass current heterogeneity-aware detection methods.

Abstract: Heterogeneous graph neural networks (HGNNs) have achieved strong performance in many real-world applications, yet targeted backdoor poisoning on heterogeneous graphs remains less studied. We consider backdoor attacks for heterogeneous node classification, where an adversary injects a small set of trigger nodes and connections during training to force specific victim nodes to be misclassified into an attacker-chosen label at test time while preserving clean performance. We propose HeteroHBA, a generative backdoor framework that selects influential auxiliary neighbors for trigger attachment via saliency-based screening and synthesizes diverse trigger features and connection patterns to better match the local heterogeneous context. To improve stealthiness, we combine Adaptive Instance Normalization (AdaIN) with a Maximum Mean Discrepancy (MMD) loss to align the trigger feature distribution with benign statistics, thereby reducing detectability, and we optimize the attack with a bilevel objective that jointly promotes attack success and maintains clean accuracy. Experiments on multiple real-world heterogeneous graphs with representative HGNN architectures show that HeteroHBA consistently achieves higher attack success than prior backdoor baselines with comparable or smaller impact on clean accuracy; moreover, the attack remains effective under our heterogeneity-aware structural defense, CSD. These results highlight practical backdoor risks in heterogeneous graph learning and motivate the development of stronger defenses.

[425] Mobility-Assisted Decentralized Federated Learning: Convergence Analysis and A Data-Driven Approach

Reza Jahani, Md Farhamdur Reza, Richeng Jin, Huaiyu Dai

Main category: cs.LG

TL;DR: DFL performance degrades due to limited connectivity and data heterogeneity. User mobility can enhance information flow in sparse networks, improving DFL convergence. The paper proposes a DFL framework with induced mobility patterns to optimize information propagation.

Details

Motivation: DFL suffers from performance degradation due to limited connectivity and data heterogeneity in sparse networks. User mobility (natural or induced) can act as relays/bridges to enhance information flow, but its impact on DFL has been largely overlooked despite its potential in next-generation wireless networks.

Method: 1) Establish convergence of DFL in sparse networks under user mobility, showing even random movement of a fraction of users can boost performance. 2) Propose a DFL framework that utilizes mobile users with induced mobility patterns, allowing them to exploit knowledge of data distribution to determine trajectories for enhanced information propagation.

Result: Theoretical demonstration that mobility improves DFL convergence. Empirical confirmation through extensive experiments showing superiority over baselines. Comprehensive analysis of how various network parameters influence DFL performance in mobile networks.

Conclusion: Mobility significantly enhances DFL performance in sparse networks. The proposed framework with induced mobility patterns effectively improves information propagation and DFL convergence, offering practical solutions for next-generation wireless networks with mobile users.

Abstract: Decentralized Federated Learning (DFL) has emerged as a privacy-preserving machine learning paradigm that enables collaborative training among users without relying on a central server. However, its performance often degrades significantly due to limited connectivity and data heterogeneity. As we move toward the next generation of wireless networks, mobility is increasingly embedded in many real-world applications. The user mobility, either natural or induced, enables clients to act as relays or bridges, thus enhancing information flow in sparse networks; however, its impact on DFL has been largely overlooked despite its potential. In this work, we systematically investigate the role of mobility in improving DFL performance. We first establish the convergence of DFL in sparse networks under user mobility and theoretically demonstrate that even random movement of a fraction of users can significantly boost performance. Building upon this insight, we propose a DFL framework that utilizes mobile users with induced mobility patterns, allowing them to exploit the knowledge of data distribution to determine their trajectories to enhance information propagation through the network. Through extensive experiments, we empirically confirm our theoretical findings, validate the superiority of our approach over baselines, and provide a comprehensive analysis of how various network parameters influence DFL performance in mobile networks.

[426] Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

Main category: cs.LG

TL;DR: Nested Learning (NL) is a new paradigm representing ML models as nested optimization problems, enabling higher-order in-context learning and continual learning capabilities.

Details

Motivation: Despite advances in Language Models, fundamental challenges remain in continual learning, self-improvement, and finding effective solutions. Current models lack coherent frameworks for multi-level learning and memory systems.

Method: Proposes Nested Learning (NL) paradigm with three core contributions: 1) Expressive Optimizers as associative memory modules, 2) Self-Modifying Learning Module that learns its own update algorithm, and 3) Continuum Memory System generalizing traditional memory views. Combines these into Hope continual learning module.

Result: Hope module shows promising results in language modeling, knowledge incorporation, few-shot generalization, continual learning, and long-context reasoning tasks.

Conclusion: NL provides a philosophical framework for designing more expressive learning algorithms with multiple levels, enabling higher-order in-context learning and potentially unlocking effective continual learning capabilities beyond current approaches.

Abstract: Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients’ information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL’s insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.

[427] Many Minds from One Model: Bayesian Transformers for Population Intelligence

Diji Yang, Yi Zhang

Main category: cs.LG

TL;DR: Population Bayesian Transformers (B-Trans) convert standard LLMs into Bayesian models that can sample diverse model instances from a single pre-trained weight set, enabling population-based decision-making for better exploration and performance.

Details

Motivation: Modern transformers are trained as single-minded deterministic systems, but intelligence emerges from many minds. The authors want to create models that can represent multiple hypotheses about data rather than just one deterministic function.

Method: B-Trans introduces a Bayesian posterior proxy by treating bias-like offsets in normalization layers as stochastic variables with Gaussian variational approximation. This creates a distribution over model behavior without full Bayesian neural network training. Sampling yields diverse model instances while maintaining competence. To ensure coherence, sampled noise is frozen at sequence level for temporal consistency.

Result: Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels show B-Trans effectively leverages wisdom of crowds. It yields superior semantic diversity while achieving better task performance compared to deterministic baselines.

Conclusion: B-Trans successfully transforms standard LLMs into Bayesian transformers that support sampling diverse model instances, enabling population-level decision-making that enhances exploration and performance through collective intelligence.

Abstract: Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.

[428] Causal Discovery with Mixed Latent Confounding via Precision Decomposition

Amir Asiaee, Samhita Pal, James O’quinn, James P. Long

Main category: cs.LG

TL;DR: DCL-DECOR: A precision-based pipeline for causal discovery in linear Gaussian systems with mixed latent confounding, separating pervasive global confounders from local dependencies.

Details

Motivation: Existing methods struggle with mixed latent confounding where some unobserved factors affect many variables (global) while others affect only small subsets (local). Differentiable DAG learners misinterpret global latent effects as causal edges, while latent-variable models only recover undirected structure.

Method: A modular pipeline that first isolates pervasive latent effects by decomposing the observed precision matrix into structured and low-rank components. The structured component captures local dependencies after removing pervasive confounders. Then applies correlated-noise DAG learning to this deconfounded representation, followed by bow-freeness enforcement.

Result: Provides identifiability results characterizing recoverable causal targets under mixed confounding. Synthetic experiments show consistent improvements in directed edge recovery over applying correlated-noise DAG learning directly to confounded data, especially varying strength and dimensionality of pervasive confounding.

Conclusion: DCL-DECOR effectively addresses mixed latent confounding by separating global and local confounding effects, enabling more accurate causal discovery in practical settings where both types of confounding coexist.

Abstract: We study causal discovery from observational data in linear Gaussian systems affected by \emph{mixed latent confounding}, where some unobserved factors act broadly across many variables while others influence only small subsets. This setting is common in practice and poses a challenge for existing methods: differentiable and score-based DAG learners can misinterpret global latent effects as causal edges, while latent-variable graphical models recover only undirected structure. We propose \textsc{DCL-DECOR}, a modular, precision-led pipeline that separates these roles. The method first isolates pervasive latent effects by decomposing the observed precision matrix into a structured component and a low-rank component. The structured component corresponds to the conditional distribution after accounting for pervasive confounders and retains only local dependence induced by the causal graph and localized confounding. A correlated-noise DAG learner is then applied to this deconfounded representation to recover directed edges while modeling remaining structured error correlations, followed by a simple reconciliation step to enforce bow-freeness. We provide identifiability results that characterize the recoverable causal target under mixed confounding and show how the overall problem reduces to well-studied subproblems with modular guarantees. Synthetic experiments that vary the strength and dimensionality of pervasive confounding demonstrate consistent improvements in directed edge recovery over applying correlated-noise DAG learning directly to the confounded data.

[429] Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Main category: cs.LG

TL;DR: The paper presents OpenForecaster 8B, a specialized language model trained on synthesized forecasting questions from news events that matches larger proprietary models in accuracy, calibration, and consistency on open-ended forecasting tasks.

Details

Motivation: High-stakes decision making requires reasoning under uncertainty about the future, but existing language models lack specialized training for open-ended forecasting questions. The authors aim to create accessible forecasting capabilities by developing a scalable training approach using news data.

Method: 1) Automated synthesis of forecasting questions from global news events using careful curation; 2) Training Qwen3 thinking models on OpenForesight dataset; 3) Using offline news corpus to prevent future information leakage; 4) Incorporating retrieval and improved reward function for RL; 5) Held-out testing from May to August 2025.

Result: OpenForecaster 8B matches much larger proprietary models in forecasting accuracy, calibration, and consistency. The training improves calibration across popular benchmarks, and all models, code, and data are open-sourced.

Conclusion: Specialized forecasting training on synthesized news-based questions enables language models to achieve competitive forecasting performance while maintaining calibration improvements that generalize across benchmarks, making forecasting research broadly accessible through open-source release.

Abstract: High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

[430] BandiK: Efficient Multi-Task Decomposition Using a Multi-Bandit Framework

András Millinghoffer, András Formanek, András Antos, Péter Antal

Main category: cs.LG

TL;DR: BandiK is a three-stage multi-task auxiliary task selection method using multi-bandits to efficiently identify beneficial auxiliary task sets for each target task, addressing computational cost and negative transfer issues.

Details

Motivation: The paper addresses challenges in multi-task learning: high computational cost of evaluating auxiliary task sets, exponential number of candidate sets, varying selection complexity across tasks, and the problem of negative transfer when selecting inappropriate auxiliary tasks.

Method: BandiK uses a three-stage approach: 1) Estimates pairwise transfers between tasks to identify beneficial joint learning pairs, 2) Constructs linear number of candidate auxiliary sets per target task (reducing exponential search space), 3) Employs Multi-Armed Bandit framework where arms represent candidate auxiliary sets realized as multiple output neural networks, integrated into a multi-bandit structure that exploits overlapping arms across tasks.

Result: The method significantly reduces computational cost by reducing the search space from exponential to linear number of candidate sets and using efficient bandit-based evaluation. The multi-bandit structure exploits that the same neural network realizes multiple arms across different bandits, improving efficiency.

Conclusion: BandiK provides an efficient solution for auxiliary task selection in multi-task learning by addressing computational bottlenecks through pairwise transfer estimation, linear candidate set construction, and a novel multi-bandit framework that leverages overlapping neural network realizations across tasks.

Abstract: The challenge of effectively transferring knowledge across multiple tasks is of critical importance and is also present in downstream tasks with foundation models. However, the nature of transfer, its transitive-intransitive nature, is still an open problem, and negative transfer remains a significant obstacle. Selection of beneficial auxiliary task sets in multi-task learning is frequently hindered by the high computational cost of their evaluation, the high number of plausible candidate auxiliary sets, and the varying complexity of selection across target tasks. To address these constraints, we introduce BandiK, a novel three-stage multi-task auxiliary task subset selection method using multi-bandits, where each arm pull evaluates candidate auxiliary sets by training and testing a multiple output neural network on a single random train-test dataset split. Firstly, BandiK estimates the pairwise transfers between tasks, which helps in identifying which tasks are likely to benefit from joint learning. In the second stage, it constructs a linear number of candidate sets of auxiliary tasks (in the number of all tasks) for each target task based on the initial estimations, significantly reducing the exponential number of potential auxiliary task sets. Thirdly, it employs a Multi-Armed Bandit (MAB) framework for each task, where the arms correspond to the performance of candidate auxiliary sets realized as multiple output neural networks over train-test data set splits. To enhance efficiency, BandiK integrates these individual task-specific MABs into a multi-bandit structure. The proposed multi-bandit solution exploits that the same neural network realizes multiple arms of different individual bandits corresponding to a given candidate set. This semi-overlapping arm property defines a novel multi-bandit cost/reward structure utilized in BandiK.

[431] FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference

Fen-Yu Hsieh, Yun-Chang Teng, Ding-Yong Hong, Jan-Jan Wu

Main category: cs.LG

TL;DR: LLM optimization framework combining N:M structured pruning and 4-bit quantization with FPGA acceleration for efficient deployment in resource-constrained environments.

Details

Motivation: LLMs have high computation and memory requirements that hinder deployment in resource-constrained environments, requiring optimization techniques to reduce costs while maintaining performance.

Method: Unified pipeline applying N:M structured pruning and 4-bit integer quantization to reduce memory footprint, followed by optimized dequantization and matrix multiplication. Includes hardware-software co-design with custom FPGA accelerator supporting systolic arrays.

Result: Achieves 4× weight storage reduction, 1.71× matrix multiplication speedup, and 1.29× end-to-end latency reduction compared to dense GPU baselines. On LLaMA-7B, structured sparsity enhances throughput per token by 1.36×.

Conclusion: Fine-grained N:M sparsity combined with quantization enables efficient LLM inference, while the FPGA accelerator provides flexibility for supporting diverse sparsity patterns beyond fixed hardware constraints.

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes their deployment in resource-constrained environments. To address this challenge, this work introduces an automation framework that leverages weight pruning and low-bit quantization, and presents a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform. In particular, we implement a unified pipeline that applies N:M structured pruning and 4-bit integer quantization to reduce the memory footprint, followed by optimized dequantization and matrix multiplication to enhance LLM inference on several hardware platforms, including CPUs, NVIDIA GPUs with Dense and 2:4 Sparse Tensor Cores, and a custom systolic-array-based FPGA accelerator. Utilizing 2:4 sparsity combined with quantization on $4096 \times 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines. Scaling analysis on the LLaMA-7B model further shows that structured sparsity enhances the throughput per token by $1.36\times$. These results demonstrate the synergy of fine-grained N:M sparsity and quantization for enabling efficient and deployable LLM inference, while the proposed FPGA accelerator offers a flexible architectural path for supporting a broader class of sparsity patterns beyond the fixed 2:4 hardware constraints.

[432] From Trial to Deployment: A SEM Analysis of Traveler Adoptions to Fully Operational Autonomous Taxis

Yutong Cai, Hua Wang

Main category: cs.LG

TL;DR: Study analyzes actual user behavior of autonomous taxis using survey data from Baidu’s Apollo Robotaxi service in Wuhan, identifying key psychological factors influencing adoption.

Details

Motivation: Most existing research on autonomous taxi acceptance uses hypothetical scenarios rather than actual operational data. This study addresses the gap by examining real user behavior from an operational autonomous taxi service.

Method: Used survey data from 336 actual users of Baidu’s Apollo Robotaxi in Wuhan, incorporating actual service attributes. Applied Structural Equation Modeling to identify six latent psychological constructs and their influence on adoption behavior measured by selection frequency in ten scenarios.

Result: Cost Sensitivity and Behavioral Intention were the strongest positive predictors of autonomous taxi adoption. Other constructs (Trust & Policy Support, Performance, Lifestyle, Education) played more nuanced roles. The model showed strong goodness-of-fit across multiple indices.

Conclusion: Provides empirical evidence to support policymaking, fare design, and public outreach strategies for scaling autonomous taxi deployments in real-world urban settings.

Abstract: Autonomous taxi services represent a transformative advancement in urban mobility, offering safety, efficiency, and round-the-clock operations. While existing literature has explored user acceptance of autonomous taxis through stated preference experiments and hypothetical scenarios, few studies have investigated actual user behavior based on operational AV services. This study addresses that gap by leveraging survey data from Wuhan, China, where Baidu’s Apollo Robotaxi service operates at scale. We design a realistic survey incorporating actual service attributes and collect 336 valid responses from actual users. Using Structural Equation Modeling, we identify six latent psychological constructs, namely Trust & Policy Support, Cost Sensitivity, Performance, Behavioral Intention, Lifestyle, and Education. Their influences on adoption behavior, measured by the selection frequency of autonomous taxis in ten scenarios, are examined and interpreted. Results show that Cost Sensitivity and Behavioral Intention are the strongest positive predictors of adoption, while other latent constructs play more nuanced roles. The model demonstrates strong goodness-of-fit across multiple indices. Our findings offer empirical evidence to support policymaking, fare design, and public outreach strategies for scaling autonomous taxis deployments in real-world urban settings.

[433] Gradient Descent as Implicit EM in Distance-Based Neural Models

Alan Oursland

Main category: cs.LG

TL;DR: Gradient descent on log-sum-exp objectives implicitly performs expectation-maximization, making optimization and inference the same process across neural architectures.

Details

Motivation: To provide a direct mathematical explanation for why neural networks trained with standard objectives exhibit probabilistic inference behaviors (soft clustering, prototype specialization, Bayesian uncertainty), rather than relying on loose analogies or post-hoc interpretations.

Method: Mathematical derivation showing that for any objective with log-sum-exp structure over distances/energies, the gradient with respect to each distance equals the negative posterior responsibility: ∂L/∂d_j = -r_j. This algebraic identity reveals gradient descent implicitly performs EM.

Result: The paper demonstrates that gradient descent on log-sum-exp objectives inherently performs expectation-maximization without requiring explicit inference algorithms. This unifies unsupervised mixture modeling, attention mechanisms, and cross-entropy classification under a single mechanism.

Conclusion: Probabilistic inference behaviors in neural networks are not emergent properties but necessary consequences of objective geometry. Optimization and inference are fundamentally the same process when using log-sum-exp objectives.

Abstract: Neural networks trained with standard objectives exhibit behaviors characteristic of probabilistic inference: soft clustering, prototype specialization, and Bayesian uncertainty tracking. These phenomena appear across architectures – in attention mechanisms, classification heads, and energy-based models – yet existing explanations rely on loose analogies to mixture models or post-hoc architectural interpretation. We provide a direct derivation. For any objective with log-sum-exp structure over distances or energies, the gradient with respect to each distance is exactly the negative posterior responsibility of the corresponding component: $\partial L / \partial d_j = -r_j$. This is an algebraic identity, not an approximation. The immediate consequence is that gradient descent on such objectives performs expectation-maximization implicitly – responsibilities are not auxiliary variables to be computed but gradients to be applied. No explicit inference algorithm is required because inference is embedded in optimization. This result unifies three regimes of learning under a single mechanism: unsupervised mixture modeling, where responsibilities are fully latent; attention, where responsibilities are conditioned on queries; and cross-entropy classification, where supervision clamps responsibilities to targets. The Bayesian structure recently observed in trained transformers is not an emergent property but a necessary consequence of the objective geometry. Optimization and inference are the same process.

[434] Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks

Shota Suzuki, Satoshi Ono

Main category: cs.LG

TL;DR: Self-supervised learning method for neural architecture search of multimodal DNNs that works with unlabeled data.

Details

Motivation: Multimodal DNNs benefit from NAS but require large labeled datasets for architecture search, which is expensive and time-consuming to obtain.

Method: Proposes a self-supervised learning approach that applies SSL to both architecture search and model pretraining processes, enabling architecture design from unlabeled data.

Result: Experimental results show the method successfully designs architectures for DNNs using only unlabeled training data.

Conclusion: SSL enables effective neural architecture search for multimodal DNNs without requiring labeled data, addressing the data bottleneck in multimodal NAS.

Abstract: Neural architecture search (NAS), which automates the architectural design process of deep neural networks (DNN), has attracted increasing attention. Multimodal DNNs that necessitate feature fusion from multiple modalities benefit from NAS due to their structural complexity; however, constructing an architecture for multimodal DNNs through NAS requires a substantial amount of labeled training data. Thus, this paper proposes a self-supervised learning (SSL) method for architecture search of multimodal DNNs. The proposed method applies SSL comprehensively for both the architecture search and model pretraining processes. Experimental results demonstrated that the proposed method successfully designed architectures for DNNs from unlabeled training data.

[435] DTI-GP: Bayesian operations for drug-target interactions using deep kernel Gaussian processes

Bence Bolgár, András Millinghoffer, Péter Antal

Main category: cs.LG

TL;DR: DTI-GP: A deep kernel learning Gaussian process model for drug-target interaction prediction with Bayesian uncertainty quantification, enabling rejection schemes, top-K selection, and ranking operations.

Details

Motivation: Need for precise probabilistic information in DTI predictions to understand limitations and boost performance. Current methods lack proper uncertainty quantification needed for reliable decision-making in drug discovery.

Method: Deep kernel learning-based GP architecture with neural embedding module for compounds and proteins, plus GP module. Uses predictive distribution sampling to estimate Bayesian precedence matrix for selection and ranking operations.

Result: DTI-GP outperforms state-of-the-art solutions. Enables Bayesian accuracy-confidence enrichment score, rejection schemes for improved enrichment, and top-K selection/ranking with high expected utility.

Conclusion: Gaussian processes provide scalable framework for DTI predictions with Bayesian inference, enabling novel operations like Bayesian classification with rejection, top-K selection, and ranking that are valuable for drug discovery applications.

Abstract: Precise probabilistic information about drug-target interaction (DTI) predictions is vital for understanding limitations and boosting predictive performance. Gaussian processes (GP) offer a scalable framework to integrate state-of-the-art DTI representations and Bayesian inference, enabling novel operations, such as Bayesian classification with rejection, top-$K$ selection, and ranking. We propose a deep kernel learning-based GP architecture (DTI-GP), which incorporates a combined neural embedding module for chemical compounds and protein targets, and a GP module. The workflow continues with sampling from the predictive distribution to estimate a Bayesian precedence matrix, which is used in fast and accurate selection and ranking operations. DTI-GP outperforms state-of-the-art solutions, and it allows (1) the construction of a Bayesian accuracy-confidence enrichment score, (2) rejection schemes for improved enrichment, and (3) estimation and search for top-$K$ selections and ranking with high expected utility.

[436] Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Main category: cs.LG

TL;DR: OMWU achieves last-iterate linear convergence to Nash equilibrium in NLHF without requiring NE uniqueness, with exponentially better dependence on instance-dependent constants.

Details

Motivation: Standard preference modeling assumes transitivity, overlooking complex human population preferences. NLHF addresses non-transitive preferences as a game, but existing algorithms use regularization causing bias. Need better convergence guarantees without NE uniqueness assumption.

Method: Analyze Optimistic Multiplicative Weights Update (OMWU) in Nash learning from human feedback framework. Prove convergence properties and identify novel marginal convergence behavior where rarely played actions grow exponentially from small values.

Result: First convergence guarantee for OMWU in NLHF: achieves last-iterate linear convergence after burn-in when NE with full support exists, with instance-dependent linear rate to original NE. No NE uniqueness assumption needed. Exponentially better dependence on instance constants than prior work.

Conclusion: OMWU shows strong theoretical convergence properties for NLHF, addressing non-transitive preferences without regularization bias. Experimental results in tabular and neural policies demonstrate practical potential for LLM alignment applications.

Abstract: Aligning large language models (LLMs) with human preferences has proven effective for enhancing model capabilities, yet standard preference modeling using the Bradley-Terry model assumes transitivity, overlooking the inherent complexity of human population preferences. Nash learning from human feedback (NLHF) addresses this by framing non-transitive preferences as a two-player zero-sum game, where alignment reduces to finding the Nash equilibrium (NE). However, existing algorithms typically rely on regularization, incurring unavoidable bias when computing the duality gap in the original game. In this work, we provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($\mathtt{OMWU}$) in NLHF, showing that it achieves last-iterate linear convergence after a burn-in phase whenever an NE with full support exists, with an instance-dependent linear convergence rate to the original NE, measured by duality gaps. Compared to prior results in Wei et al. (2020), we do not require the assumption of NE uniqueness. Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values, enabling exponentially better dependence on instance-dependent constants than prior results. Experiments corroborate the theoretical strengths of $\mathtt{OMWU}$ in both tabular and neural policy classes, demonstrating its potential for LLM applications.

[437] Discovering Coordinated Joint Options via Inter-Agent Relative Dynamics

Raul D. Steleac, Mohan Sridharan, David Abel

Main category: cs.LG

TL;DR: Novel multi-agent option discovery method using joint-state abstraction and neural graph Laplacian to discover strongly coordinated behaviors through state synchronization patterns.

Details

Motivation: Multi-agent settings suffer from exponential state space growth, making coordinated behaviors valuable but challenging to discover. Existing methods sacrifice coordination by producing loosely coupled or independent behaviors.

Method: Proposes joint-state abstraction that compresses state space while preserving coordination information. Uses Fermat state (maximal team alignment) to measure spreadness (team-level misalignment), then employs neural graph Laplacian estimator to derive options capturing state synchronization patterns between agents.

Result: Evaluated across multiple scenarios in two multi-agent domains, showing that discovered options yield stronger downstream coordination capabilities compared to alternative option discovery methods.

Conclusion: The approach successfully discovers strongly coordinated multi-agent options by leveraging state synchronization as a natural foundation for coordination, addressing limitations of existing methods.

Abstract: Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the \textit{Fermat} state, and use it to define a measure of \textit{spreadness}, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.

[438] AODDiff: Probabilistic Reconstruction of Aerosol Optical Depth via Diffusion-based Bayesian Inference

Linhao Fan, Hongqiang Fang, Jingyang Dai, Yong Jiang, Qixing Zhang

Main category: cs.LG

TL;DR: AODDiff: A diffusion-based Bayesian framework for probabilistic reconstruction of Aerosol Optical Depth fields with uncertainty quantification, handling incomplete training data and various reconstruction tasks without retraining.

Details

Motivation: Current AOD reconstruction models are limited by scarce complete training data and lack uncertainty quantification, which is critical for atmospheric monitoring applications.

Method: Proposes AODDiff with two key strategies: 1) corruption-aware training to learn spatiotemporal AOD prior from naturally incomplete data, and 2) decoupled annealing posterior sampling to integrate heterogeneous observations as constraints for guided generation.

Result: Validated on Reanalysis data, AODDiff shows efficacy and robustness across downscaling and inpainting tasks, maintaining high spatial spectral fidelity and enabling uncertainty quantification through multiple sampling.

Conclusion: AODDiff provides a flexible probabilistic framework for AOD reconstruction that addresses data scarcity issues, supports various tasks without retraining, and offers critical uncertainty quantification for downstream applications.

Abstract: High-quality reconstruction of Aerosol Optical Depth (AOD) fields is critical for Atmosphere monitoring, yet current models remain constrained by the scarcity of complete training data and a lack of uncertainty quantification.To address these limitations, we propose AODDiff, a probabilistic reconstruction framework based on diffusion-based Bayesian inference. By leveraging the learned spatiotemporal probability distribution of the AOD field as a generative prior, this framework can be flexibly adapted to various reconstruction tasks without requiring task-specific retraining. We first introduce a corruption-aware training strategy to learns a spatiotemporal AOD prior solely from naturally incomplete data. Subsequently, we employ a decoupled annealing posterior sampling strategy that enables the more effective and integration of heterogeneous observations as constraints to guide the generation process. We validate the proposed framework through extensive experiments on Reanalysis data. Results across downscaling and inpainting tasks confirm the efficacy and robustness of AODDiff, specifically demonstrating its advantage in maintaining high spatial spectral fidelity. Furthermore, as a generative model, AODDiff inherently enables uncertainty quantification via multiple sampling, offering critical confidence metrics for downstream applications.

[439] Characterization of Transfer Using Multi-task Learning Curves

András Millinghoffer, Bence Bolgár, Péter Antal

Main category: cs.LG

TL;DR: The paper proposes using multi-task learning curves over varying sample sizes to model transfer effects, as an alternative to gradient-based approaches, showing better computational efficiency and broader applicability.

Details

Motivation: Current approaches to studying transfer effects focus on perturbing models through gradient updates during training. The authors hypothesize that perturbing the dataset by including more samples provides a more fundamental characterization of transfer effects in inductive inference.

Method: Developed quantitative modeling of transfer effects using multi-task learning curves that approximate inductive performance over varying sample sizes. Created an efficient method analogous to Task Affinity Grouping to approximate these curves, comparing statistical and computational approaches to transfer.

Result: Learning curves better capture multi-task learning effects than previous approaches. Multi-task extensions of these curves can delineate pairwise and contextual transfer effects in foundation models. The statistical approach shows considerably higher compute costs but better power and broader applicability than computational methods.

Conclusion: Multi-task learning curves provide a complementary and more fundamental characterization of transfer effects, offering better insights into inductive inference with accumulating data and broader applicability for analyzing foundation models.

Abstract: Transfer effects manifest themselves both during training using a fixed data set and in inductive inference using accumulating data. We hypothesize that perturbing the data set by including more samples, instead of perturbing the model by gradient updates, provides a complementary and more fundamental characterization of transfer effects. To capture this phenomenon, we quantitatively model transfer effects using multi-task learning curves approximating the inductive performance over varying sample sizes. We describe an efficient method to approximate multi-task learning curves analogous to the Task Affinity Grouping method applied during training. We compare the statistical and computational approaches to transfer, which indicates considerably higher compute costs for the previous but better power and broader applicability. Evaluations are performed using a benchmark drug-target interaction data set. Our results show that learning curves can better capture the effects of multi-task learning and their multi-task extensions can delineate pairwise and contextual transfer effects in foundation models.

[440] PRISM: A hierarchical multiscale approach for time series forecasting

Zihao Chen, Alexandre Andre, Wenrui Ma, Ian Knight, Sergey Shuvaev, Eva Dyer

Main category: cs.LG

TL;DR: PRISM introduces a hierarchical tree-based partitioning method for time series forecasting that captures multi-scale features through time-frequency analysis, outperforming state-of-the-art methods.

Details

Motivation: Real-world time series contain complex multi-scale features including global trends, local fine-grained structure, and intermediate scales, making accurate forecasting challenging. Existing methods struggle to capture this hierarchical complexity effectively.

Method: PRISM uses a learnable tree-based partitioning where the root captures global trends and recursive splits reveal increasingly localized views. At each level, data is projected onto time-frequency bases (wavelets or exponential moving averages) to extract scale-specific features, which are aggregated across the hierarchy.

Result: Experiments across benchmark datasets show that PRISM outperforms state-of-the-art forecasting methods, demonstrating superior performance in capturing both global structure and local dynamics.

Conclusion: The hierarchical tree-based approach provides a lightweight and flexible framework for multivariate time series forecasting that effectively handles multi-scale temporal patterns.

Abstract: Forecasting is critical in areas such as finance, biology, and healthcare. Despite the progress in the field, making accurate forecasts remains challenging because real-world time series contain both global trends, local fine-grained structure, and features on multiple scales in between. Here, we present a new forecasting method, PRISM (Partitioned Representation for Iterative Sequence Modeling), that addresses this challenge through a learnable tree-based partitioning of the signal. At the root of the tree, a global representation captures coarse trends in the signal, while recursive splits reveal increasingly localized views of the signal. At each level of the tree, data are projected onto a time-frequency basis (e.g., wavelets or exponential moving averages) to extract scale-specific features, which are then aggregated across the hierarchy. This design allows the model to jointly capture global structure and local dynamics of the signal, enabling accurate forecasting. Experiments across benchmark datasets show that our method outperforms state-of-the-art methods for forecasting. Overall, these results demonstrate that our hierarchical approach provides a lightweight and flexible framework for forecasting multivariate time series. The code is available at https://github.com/nerdslab/prism.

[441] Spectral Graph Neural Networks for Cognitive Task Classification in fMRI Connectomes

Debasis Maji, Arghya Banerjee, Debaditya Barman

Main category: cs.LG

TL;DR: SpectralBrainGNN uses graph Fourier transforms on brain connectomes to classify cognitive tasks with 96.25% accuracy on HCP-Task dataset.

Details

Motivation: Current brain state decoding methods often miss complex topological dependencies and multi-scale interactions in functional connectivity patterns. There's a need for better integration of machine learning with brain network analysis to extract more interpretable representations from fMRI data.

Method: Proposed SpectralBrainGNN model uses spectral convolution framework based on graph Fourier transforms computed via normalized Laplacian eigendecomposition. Models brain regions as nodes and functional connections as edges in a graph neural network architecture.

Result: Achieved 96.25% classification accuracy on the Human Connectome Project-Task (HCP-Task) dataset, demonstrating superior performance in cognitive task classification from fMRI connectomes.

Conclusion: SpectralBrainGNN effectively captures topological dependencies in brain networks for cognitive task classification, outperforming conventional approaches. The publicly available implementation supports reproducibility and future research in neuroimaging and machine learning.

Abstract: Cognitive task classification using machine learning plays a central role in decoding brain states from neuroimaging data. By integrating machine learning with brain network analysis, complex connectivity patterns can be extracted from functional magnetic resonance imaging connectomes. This process transforms raw blood-oxygen-level-dependent (BOLD) signals into interpretable representations of cognitive processes. Graph neural networks (GNNs) further advance this paradigm by modeling brain regions as nodes and functional connections as edges, capturing topological dependencies and multi-scale interactions that are often missed by conventional approaches. Our proposed SpectralBrainGNN model, a spectral convolution framework based on graph Fourier transforms (GFT) computed via normalized Laplacian eigendecomposition. Experiments on the Human Connectome Project-Task (HCPTask) dataset demonstrate the effectiveness of the proposed approach, achieving a classification accuracy of 96.25%. The implementation is publicly available at https://github.com/gnnplayground/SpectralBrainGNN to support reproducibility and future research.

[442] Frequent subgraph-based persistent homology for graph classification

Xinyang Chen, Amaël Broustet, Guoting Chen

Main category: cs.LG

TL;DR: Proposes Frequent Subgraph Filtration (FSF) for persistent homology on graphs, generating frequency-based persistent homology features that integrate with ML models and GNNs for improved graph classification.

Details

Motivation: Current persistent homology methods on graphs use limited filtrations (degree/weight-based) that miss richer features like recurring patterns across datasets, restricting expressive power.

Method: Introduces Frequent Subgraph Filtration (FSF) derived from frequent subgraphs, producing stable frequency-based persistent homology features. Also proposes FPH-ML (machine learning model) and FPH-GNNs (hybrid framework integrating FPH with graph neural networks).

Result: FPH-ML achieves competitive/superior accuracy vs kernel/degree-based methods. FPH-GNNs yield 0.4-21% relative performance gains, with up to 8.2 percentage point improvements over GCN/GIN backbones across benchmarks.

Conclusion: FSF bridges frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction that enhances graph representation learning.

Abstract: Persistent homology (PH) has recently emerged as a powerful tool for extracting topological features. Integrating PH into machine learning and deep learning models enhances topology awareness and interpretability. However, most PH methods on graphs rely on a limited set of filtrations, such as degree-based or weight-based filtrations, which overlook richer features like recurring information across the dataset and thus restrict expressive power. In this work, we propose a novel graph filtration called Frequent Subgraph Filtration (FSF), which is derived from frequent subgraphs and produces stable and information-rich frequency-based persistent homology (FPH) features. We study the theoretical properties of FSF and provide both proofs and experimental validation. Beyond persistent homology itself, we introduce two approaches for graph classification: an FPH-based machine learning model (FPH-ML) and a hybrid framework that integrates FPH with graph neural networks (FPH-GNNs) to enhance topology-aware graph representation learning. Our frameworks bridge frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction. Experimental results show that FPH-ML achieves competitive or superior accuracy compared with kernel-based and degree-based filtration methods. When integrated into graph neural networks, FPH yields relative performance gains ranging from 0.4 to 21 percent, with improvements of up to 8.2 percentage points over GCN and GIN backbones across benchmarks.

[443] MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control

Yongwei Zhang, Yuanzhe Xing, Quan Quan, Zhikun She

Main category: cs.LG

TL;DR: MSACL integrates exponential stability theory with maximum entropy RL using multi-step Lyapunov certificate learning to achieve provably stable model-free RL with simple rewards.

Details

Motivation: Achieving provable stability in model-free RL is challenging, especially in balancing exploration with rigorous safety. Existing methods often require complex reward engineering and lack theoretical stability guarantees.

Method: MSACL integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. It uses off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions, introduces Exponential Stability Labels (ESL) and λ-weighted aggregation for bias-variance trade-off, and guides policy optimization with stability-aware advantage functions.

Result: MSACL demonstrates superiority over state-of-the-art Lyapunov-based RL algorithms across six benchmarks (stabilization and nonlinear tracking tasks). It achieves exponential stability and rapid convergence under simple rewards, exhibits significant robustness to uncertainties, and generalizes to unseen trajectories. Sensitivity analysis establishes n=20 as robust default multi-step horizon.

Conclusion: MSACL provides a foundation for verifiably safe learning-based control by linking Lyapunov theory with off-policy actor-critic frameworks, enabling provable stability without complex reward engineering.

Abstract: Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $λ$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.

[444] Semi-overlapping Multi-bandit Best Arm Identification for Sequential Support Network Learning

András Antos, András Millinghoffer, Péter Antal

Main category: cs.LG

TL;DR: A new framework called Sequential Support Network Learning (SSNL) uses a semi-overlapping multi-bandit model (SOMMAB) to efficiently learn optimal support networks from sparse candidate lists through shared evaluations.

Details

Motivation: Many AI/ML problems require evaluating partners' contributions through asymmetric, computationally intensive processes while selecting the most beneficial candidates. Sequential approaches to these problems need unification and efficiency improvements.

Method: Proposes Sequential Support Network Learning (SSNL) framework using a semi-overlapping multi-bandit (SOMMAB) model where single evaluations provide distinct feedback to multiple bandits due to structural overlap. Develops a generalized GapE algorithm for SOMMABs with new exponential error bounds.

Result: Derives new exponential error bounds that improve the best known constant in the exponent for multi-bandit best-arm identification. Bounds scale linearly with overlap degree, showing significant sample-complexity gains from shared evaluations.

Conclusion: Provides theoretical foundation and improved performance guarantees for sequential learning tools to identify support networks from sparse candidates in multi-task learning, auxiliary task learning, federated learning, and multi-agent systems.

Abstract: Many modern AI and ML problems require evaluating partners’ contributions through shared yet asymmetric, computationally intensive processes and the simultaneous selection of the most beneficial candidates. Sequential approaches to these problems can be unified under a new framework, Sequential Support Network Learning (SSNL), in which the goal is to select the most beneficial candidate set of partners for all participants using trials; that is, to learn a directed graph that represents the highest-performing contributions. We demonstrate that a new pure-exploration model, the semi-overlapping multi-(multi-armed) bandit (SOMMAB), in which a single evaluation provides distinct feedback to multiple bandits due to structural overlap among their arms, can be used to learn a support network from sparse candidate lists efficiently. We develop a generalized GapE algorithm for SOMMABs and derive new exponential error bounds that improve the best known constant in the exponent for multi-bandit best-arm identification. The bounds scale linearly with the degree of overlap, revealing significant sample-complexity gains arising from shared evaluations. From an application point of view, this work provides a theoretical foundation and improved performance guarantees for sequential learning tools for identifying support networks from sparse candidates in multiple learning problems, such as in multi-task learning (MTL), auxiliary task learning (ATL), federated learning (FL), and in multi-agent systems (MAS).

[445] Attribution-Guided Distillation of Matryoshka Sparse Autoencoders

Cristina P. Martin-Linares, Jonathan P. Ling

Main category: cs.LG

TL;DR: DMSAEs distill a compact core of consistently useful features from SAEs and reuse it to train new SAEs, improving feature consistency and transferability across sparsity levels.

Details

Motivation: Sparse autoencoders often produce redundant features that vary across training runs and sparsity levels, making interpretations difficult to transfer and reuse.

Method: Iterative distillation cycle: train Matryoshka SAE with shared core, use gradient X activation to measure feature contributions, keep smallest subset explaining fixed fraction of attribution, transfer only core encoder weights across cycles.

Result: On Gemma-2-2B, seven distillation cycles yielded a distilled core of 197 consistently selected features, improving SAEBench metrics and enabling feature transfer across sparsity levels.

Conclusion: DMSAEs demonstrate that consistent sets of latent features can be distilled and transferred across sparsity levels, addressing SAE feature inconsistency problems.

Abstract: Sparse autoencoders (SAEs) aim to disentangle model activations into monosemantic, human-interpretable features. In practice, learned features are often redundant and vary across training runs and sparsity levels, which makes interpretations difficult to transfer and reuse. We introduce Distilled Matryoshka Sparse Autoencoders (DMSAEs), a training pipeline that distills a compact core of consistently useful features and reuses it to train new SAEs. DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature’s contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution. Only the core encoder weight vectors are transferred across cycles; the core decoder and all non-core latents are reinitialized each time. On Gemma-2-2B layer 12 residual stream activations, seven cycles of distillation (500M tokens, 65k width) yielded a distilled core of 197 features that were repeatedly selected. Training using this distilled core improves several SAEBench metrics and demonstrates that consistent sets of latent features can be transferred across sparsity levels

[446] Efficiently Estimating Data Efficiency for Language Model Fine-tuning

Gyung Hyun Je, Colin Raffel

Main category: cs.LG

TL;DR: The paper proposes a method to predict data efficiency (how many fine-tuning examples are needed) for LLM tasks using gradient cosine similarity of low-confidence examples, eliminating costly incremental annotation cycles.

Details

Motivation: LLMs often need fine-tuning for specialized tasks, but it's unknown how many examples are needed for desired performance, leading to expensive cycles of incremental annotation and retraining. The authors show that performant LLMs may struggle zero-shot but improve with fine-tuning, creating a need to predict data efficiency without incremental annotation.

Method: Proposes using gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. Introduces a concrete metric to quantify a task’s data efficiency.

Result: Validated on 30 specialized tasks with varying data efficiencies, achieving 8.6% error in overall data efficiency prediction. Typically eliminates hundreds of unnecessary annotations per task.

Conclusion: The proposed method effectively predicts task data efficiency using minimal labeled samples, reducing annotation costs and eliminating wasteful incremental annotation cycles in LLM fine-tuning.

Abstract: While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task’s data efficiency–i.e., the number of fine-tuning examples needed to achieve a desired level of performance–is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task’s data efficiency without requiring incremental annotation. After introducing a concrete metric that quantifies a task’s data efficiency, we propose using the gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. We validate our approach on a diverse set of tasks with varying data efficiencies, attaining 8.6% error in overall data efficiency prediction and typically eliminating hundreds of unnecessary annotations on each task. Our experiment results and implementation code are available on GitHub.

[447] Diffusion Language Models are Provably Optimal Parallel Samplers

Haozhe Jiang, Nika Haghtalab, Lijie Chen

Main category: cs.LG

TL;DR: DLMs with chain-of-thought can simulate parallel sampling algorithms with optimal sequential steps, but need remasking/revision for optimal space complexity.

Details

Motivation: To provide rigorous theoretical foundation for diffusion language models as efficient parallel samplers compared to autoregressive models, and to understand their computational advantages and limitations.

Method: Formalize parallel sampling model, analyze DLMs augmented with polynomial-length chain-of-thought, prove simulation capabilities with optimal sequential steps, and investigate space complexity improvements with remasking/revision operations.

Result: DLMs with CoT can simulate any parallel sampling algorithm using optimal number of sequential steps. With remasking/revision + CoT, they achieve optimal space complexity. Revision/remasking DLMs are strictly more expressive than those without.

Conclusion: DLMs are theoretically justified as most efficient parallel samplers, and enabling revision in DLMs is advocated for optimal space efficiency and expressivity.

Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks) or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity. We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more expressive than those without. Our results not only provide a theoretical justification for the promise of DLMs as the most efficient parallel sampler, but also advocate for enabling revision in DLMs.

[448] ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier

Main category: cs.LG

TL;DR: ResponseRank is a method that learns preference strength from noisy proxy signals (like response times or annotator agreement) by ranking responses locally within strata, improving sample efficiency and robustness in preference learning tasks.

Details

Motivation: Binary choices in RLHF only convey direction, not strength of preference. Strength is crucial for decision-making and generalization, but hard to measure reliably. Existing proxies like response times and inter-annotator agreement are noisy and confounded.

Method: ResponseRank uses relative differences in proxy signals to rank responses in pairwise comparisons by inferred preference strength. It controls for systemic variation by comparing signals only locally within carefully constructed strata, enabling robust learning of utility differences with minimal assumptions about the strength signal.

Result: Empirical evidence shows improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns).

Conclusion: ResponseRank provides a novel approach to robustly learn preference strength from noisy signals, with the Pearson Distance Correlation (PDC) metric isolating cardinal utility learning from ordinal accuracy, advancing preference modeling beyond simple binary choices.

Abstract: Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.

[449] Generative Classifiers Avoid Shortcut Solutions

Alexander C. Li, Ananya Kumar, Deepak Pathak

Main category: cs.LG

TL;DR: Generative classifiers using class-conditional generative models outperform discriminative ones on distribution shift by modeling all features instead of relying on spurious correlations.

Details

Motivation: Discriminative classifiers often learn shortcuts and spurious correlations that fail under distribution shift, while generative classifiers can avoid this by modeling all features comprehensively.

Method: Use class-conditional generative models (diffusion-based and autoregressive) as classifiers, which are simple to train without specialized augmentations, strong regularization, or prior knowledge of spurious correlations.

Result: Achieve state-of-the-art performance on five standard image and text distribution shift benchmarks, reduce spurious correlation impact in medical/satellite datasets, and provide theoretical analysis in Gaussian toy setting.

Conclusion: Generative classifiers offer robust performance under distribution shift by avoiding overreliance on spurious features, with analysis revealing conditions where they outperform discriminative approaches.

Abstract: Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.

[450] On the geometry and topology of representations: the manifolds of modular addition

Gabriela Moisescu-Pareja, Gavin McCracken, Harley Wiltzer, Vincent Létourneau, Colin Daniels, Doina Precup, Jonathan Love

Main category: cs.LG

TL;DR: Both uniform and learnable attention architectures implement the same modular addition algorithm with topologically equivalent representations, contrary to previous Clock vs Pizza interpretations.

Details

Motivation: To challenge the Clock and Pizza interpretations which argued that different architectural designs yield distinct circuits for modular addition, and to show that both architectures actually implement the same algorithm.

Method: Instead of interpreting individual neurons, identify all neurons corresponding to each learned representation and study the collective group as one entity using topological tools to analyze learned representations as manifolds.

Result: Statistical analysis across hundreds of circuits reveals that both uniform attention and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations.

Conclusion: The Clock and Pizza interpretations are incorrect - different architectural designs do not yield distinct circuits for modular addition; both architectures implement the same underlying algorithm with equivalent representations.

Abstract: The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to argue that different architectural designs can yield distinct circuits for modular addition. In this work, we show that this is not the case, and that both uniform attention and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations. Our methodology goes beyond the interpretation of individual neurons and weights. Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity. This method reveals that each learned representation is a manifold that we can study utilizing tools from topology. Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.

[451] Maxwell’s Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Simon Dufort-Labbé, Pierluca D’Oro, Evgenii Nikishin, Razvan Pascanu, Pierre-Luc Bacon, Aristide Baratin

Main category: cs.LG

TL;DR: DemP uses controlled dying neurons via noise injection and one-cycle scheduling to achieve better structured pruning with faster training.

Details

Motivation: Traditional view sees dying neurons (inactive/saturated units) as harmful, but this paper explores them as a resource for efficient model compression and optimization.

Method: Demon Pruning (DemP) controls dead neuron proliferation through noise injection on active units and one-cycle schedule regularization, dynamically creating network sparsity.

Result: Outperforms existing dense-to-sparse structured pruning methods on CIFAR-10 and ImageNet, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56×.

Conclusion: Dying neurons can be leveraged as a resource for efficient model compression, providing a novel perspective on this traditionally problematic phenomenon.

Abstract: When training neural networks, dying neurons – units becoming inactive or saturated – are traditionally seen as harmful. This paper sheds new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56$\times$. These findings provide a novel perspective on dying neurons as a resource for efficient model compression and optimization.

[452] Transfer learning of state-based potential games for process optimization in decentralized manufacturing systems

Steve Yuwono, Dorothea Schwung, Andreas Schwung

Main category: cs.LG

TL;DR: Online transfer learning approach for state-based potential games in manufacturing systems, enabling knowledge sharing among similar players to improve learning efficiency and outcomes.

Details

Motivation: Address practical industrial scenarios where knowledge sharing among similar players can enhance learning in large-scale, decentralized manufacturing systems, improving production efficiency and reducing power consumption.

Method: Develop TL-SbPGs (Transfer Learning in State-based Potential Games) with transfer learning concepts and similarity criteria for players, offering two settings: predefined similarities and dynamically inferred similarities during training. Includes method to optimize timing and weighting of knowledge transfer.

Result: Experimental results from laboratory-scale testbed show TL-SbPGs improve production efficiency and reduce power consumption compared to vanilla SbPGs.

Conclusion: TL-SbPGs provide an effective framework for distributed self-optimization in manufacturing systems by enabling knowledge transfer among similar players, accelerating convergence and improving overall system performance.

Abstract: This paper presents a novel online transfer learning approach in state-based potential games (TL-SbPGs) for distributed self-optimization in manufacturing systems. The approach targets practical industrial scenarios where knowledge sharing among similar players enhances learning in large-scale and decentralized environments. TL-SbPGs enable players to reuse learned policies from others, which improves learning outcomes and accelerates convergence. To accomplish this goal, we develop transfer learning concepts and similarity criteria for players, which offer two distinct settings: (a) predefined similarities between players and (b) dynamically inferred similarities between players during training. The applicability of the SbPG framework to transfer learning is formally established. Furthermore, we present a method to optimize the timing and weighting of knowledge transfer. Experimental results from a laboratory-scale testbed show that TL-SbPGs improve production efficiency and reduce power consumption compared to vanilla SbPGs.

[453] Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, Dacheng Tao

Main category: cs.LG

TL;DR: This paper provides a comprehensive survey of model merging techniques, proposing a new taxonomy, discussing applications across various domains, and identifying future research directions.

Details

Motivation: Model merging is an efficient technique that doesn't require raw training data or expensive computation, but there's a significant gap in systematic literature review of these techniques despite their increasing prevalence.

Method: The survey proposes a new taxonomic approach to exhaustively discuss existing model merging methods, then examines applications across large language models, multimodal LLMs, and over ten machine learning subfields including continual learning, multi-task learning, and few-shot learning.

Result: The paper provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and maintains a curated list of papers on model merging available at a GitHub repository.

Conclusion: The survey highlights remaining challenges in model merging and discusses future research directions, serving as a comprehensive resource for researchers and practitioners interested in model merging techniques.

Abstract: Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and more than ten machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications.

[454] A Systematic Survey on Large Language Models for Algorithm Design

Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Kun Mao, Zhichao Lu, Zhenkun Wang, Mingxuan Yuan, Qingfu Zhang

Main category: cs.LG

TL;DR: Systematic review of LLMs in algorithm design, categorizing their roles as optimizers, predictors, extractors, and designers across the algorithm design pipeline and diverse applications.

Details

Motivation: Despite rapid progress in using LLMs for algorithm design across domains like combinatorial optimization and scientific discovery, there's no comprehensive systematic review. Existing surveys are either too narrow or have different objectives, hindering holistic understanding of the field.

Method: The paper provides a systematic review with a taxonomy categorizing LLM roles (optimizers, predictors, extractors, designers). It synthesizes literature across three phases of the algorithm design pipeline and diverse algorithmic applications.

Result: The review analyzes progress, advantages, and limitations of LLMs in each role category, mapping the current landscape of algorithm design with LLMs across different application domains.

Conclusion: The paper outlines key open challenges and opportunities to guide future research in algorithm design with LLMs, providing a comprehensive framework for understanding this rapidly evolving field.

Abstract: Algorithm design is crucial for effective problem-solving across various domains. The advent of Large Language Models (LLMs) has notably enhanced the automation and innovation within this field, offering new perspectives and promising solutions. In just a few years, this integration has yielded remarkable progress in areas ranging from combinatorial optimization to scientific discovery. Despite this rapid expansion, a holistic understanding of the field is hindered by the lack of a systematic review, as existing surveys either remain limited to narrow sub-fields or with different objectives. This paper seeks to provide a systematic review of algorithm design with LLMs. We introduce a taxonomy that categorises the roles of LLMs as optimizers, predictors, extractors and designers, analyzing the progress, advantages, and limitations within each category. We further synthesize literature across the three phases of the algorithm design pipeline and across diverse algorithmic applications that define the current landscape. Finally, we outline key open challenges and opportunities to guide future research.

[455] SoundnessBench: A Soundness Benchmark for Neural Network Verifiers

Xingjian Zhou, Keyi Shen, Andy Xu, Hongji Xu, Cho-Jui Hsieh, Huan Zhang, Zhouxing Shi

Main category: cs.LG

TL;DR: SoundnessBench: A new benchmark for testing soundness of neural network verifiers with deliberately inserted hidden counterexamples to identify false verification claims.

Details

Motivation: Existing NN verification benchmarks lack ground-truth for hard instances where no current verifier can verify properties and no counterexample can be found, making it difficult to validate verifier soundness on challenging cases.

Method: Developed SoundnessBench with instances containing deliberately inserted counterexamples hidden from adversarial attacks. Designed training method to produce NNs with hidden counterexamples across various model architectures, activation functions, and input data.

Result: Training effectively produces hidden counterexamples, and SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers.

Conclusion: SoundnessBench provides a valuable benchmark for testing the soundness of NN verifiers, addressing a critical gap in existing verification benchmarks and helping ensure reliability in safety-critical applications.

Abstract: Neural network (NN) verification aims to formally verify properties of NNs, which is crucial for ensuring the behavior of NN-based models in safety-critical applications. In recent years, the community has developed many NN verifiers and benchmarks to evaluate them. However, existing benchmarks typically lack ground-truth for hard instances where no current verifier can verify the property and no counterexample can be found. This makes it difficult to validate the soundness of a verifier, when it claims verification on such challenging instances that no other verifier can handle. In this work, we develop a new benchmark for NN verification, named SoundnessBench, specifically for testing the soundness of NN verifiers. SoundnessBench consists of instances with deliberately inserted counterexamples that are hidden from adversarial attacks commonly used to find counterexamples. Thereby, it can identify false verification claims when hidden counterexamples are known to exist. We design a training method to produce NNs with hidden counterexamples and systematically construct our SoundnessBench with instances across various model architectures, activation functions, and input data. We demonstrate that our training effectively produces hidden counterexamples and our SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers. Our code is available at https://github.com/mvp-harry/SoundnessBench and our dataset is available at https://huggingface.co/datasets/SoundnessBench/SoundnessBench.

[456] ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Main category: cs.LG

TL;DR: ParetoHqD improves multiobjective LLM alignment by representing preferences as directions in objective space and using Pareto front data for two-stage fine-tuning.

Details

Motivation: Current offline multiobjective alignment algorithms like Rewards-in-Context suffer from inappropriate preference representations and imbalanced reward scores, limiting their ability to adequately serve diverse user needs.

Method: ParetoHqD represents human preferences as preference directions in objective space and treats data near the Pareto front as high-quality data. It uses a two-stage supervised fine-tuning process where each stage employs an individual Pareto high-quality training set that best matches its preference direction.

Result: Experimental results show ParetoHqD outperforms five baselines on two multiobjective alignment tasks, demonstrating its superiority.

Conclusion: ParetoHqD effectively addresses limitations of existing multiobjective alignment methods by using preference direction representations and Pareto front data, leading to better alignment with multiple human expectations and values.

Abstract: Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as “high-quality” data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

[457] Tazza: Shuffling Neural Network Parameters for Secure and Private Federated Learning

Kichang Lee, Jaeho Jin, JaeYeon Park, Songkuk Kim, JeongGil Ko

Main category: cs.LG

TL;DR: Tazza is a secure federated learning framework that simultaneously defends against gradient inversion and model poisoning attacks using weight shuffling and shuffled model validation, achieving robust defense with 6.7x better computational efficiency.

Details

Motivation: Federated learning preserves data privacy but remains vulnerable to security threats like gradient inversion and model poisoning by malicious clients. Existing solutions address these issues separately, sacrificing either system robustness or model accuracy.

Method: Tazza leverages permutation equivariance and invariance properties of neural networks through weight shuffling and shuffled model validation to enhance resilience against diverse poisoning attacks while ensuring data confidentiality and high model accuracy.

Result: Comprehensive evaluations on various datasets and embedded platforms show Tazza achieves robust defense with up to 6.7x improved computational efficiency compared to alternative schemes, without compromising performance.

Conclusion: Tazza provides a secure and efficient federated learning framework that simultaneously addresses both gradient inversion and model poisoning challenges, offering practical deployment advantages on embedded platforms.

Abstract: Federated learning enables decentralized model training without sharing raw data, preserving data privacy. However, its vulnerability towards critical security threats, such as gradient inversion and model poisoning by malicious clients, remain unresolved. Existing solutions often address these issues separately, sacrificing either system robustness or model accuracy. This work introduces Tazza, a secure and efficient federated learning framework that simultaneously addresses both challenges. By leveraging the permutation equivariance and invariance properties of neural networks via weight shuffling and shuffled model validation, Tazza enhances resilience against diverse poisoning attacks, while ensuring data confidentiality and high model accuracy. Comprehensive evaluations on various datasets and embedded platforms show that Tazza achieves robust defense with up to 6.7x improved computational efficiency compared to alternative schemes, without compromising performance.

[458] Lagrangian Index Policy for Restless Bandits with Average Reward

Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah

Main category: cs.LG

TL;DR: The paper studies Lagrangian Index Policy (LIP) for restless multi-armed bandits, showing it outperforms Whittle Index Policy (WIP) in problematic cases, proposes RL algorithms for LIP with lower memory requirements, and provides analytical results for specific applications.

Details

Motivation: To develop a more robust alternative to Whittle Index Policy for restless multi-armed bandits that maintains good performance even when WIP fails, while also being more memory-efficient for reinforcement learning implementations.

Method: The paper analyzes Lagrangian Index Policy (LIP) theoretically, proposes both tabular and neural network-based reinforcement learning algorithms for LIP in model-free settings, provides analytical calculations for the restart model, and gives a new proof of asymptotic optimality using exchangeability and de Finetti’s theorem.

Result: LIP performs similarly to WIP in most cases but significantly outperforms it when WIP shows poor performance. The proposed RL schemes for LIP require much less memory than analogous WIP schemes. Analytical results are provided for the restart model applicable to web crawling and age of information minimization.

Conclusion: LIP is a superior alternative to WIP for restless multi-armed bandits, offering better robustness in problematic cases, lower memory requirements for RL implementations, and theoretical guarantees including asymptotic optimality for homogeneous arms.

Abstract: We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrangian index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti’s theorem.

[459] Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison

Marianna Nezhurina, Jörg Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev

Main category: cs.LG

TL;DR: Open-sci-ref introduces a family of dense transformer models (0.13B to 1.7B parameters) trained on 8 open reference datasets as research baselines, providing standardized reference points for comparing training approaches across scales.

Details

Motivation: To establish reference baselines that enable researchers to assess the sanity and quality of alternative training approaches across different model scales and datasets, and to facilitate standardized comparison of training procedures.

Method: Trained dense transformer models across multiple parameter scales (0.13B to 1.7B) and token scales (up to 1T) on 8 recent open reference datasets, with intermediate checkpoints, logs, code, and downstream evaluations.

Result: Training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. The established baselines allow training procedures to be compared through their scaling trends on a common compute axis.

Conclusion: Open-sci-ref provides comprehensive reference baselines with intermediate checkpoints and evaluation tools to simplify reproduction, standardize comparison, and facilitate future research in model training approaches.

Abstract: We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.

[460] Deep sequence models tend to memorize geometrically; it is unclear why

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

Main category: cs.LG

TL;DR: The paper identifies “geometric memory” as a novel form of fact storage in deep sequence models, where embeddings encode global relationships between all entities (including non-co-occurring ones), transforming complex reasoning into simple navigation tasks.

Details

Motivation: To understand how deep sequence models store atomic facts beyond simple associative memory/lookup tables, and to explain the emergence of sophisticated geometric embeddings that encode global relationships between entities.

Method: The authors analyze neural embedding geometries, connect findings to Node2Vec, and demonstrate how geometry stems from spectral bias that arises naturally despite lack of typical pressures. They show how this transforms hard reasoning tasks into easy navigation.

Result: Identified geometric memory as a fundamentally different storage mechanism from associative memory, showing it emerges even when more complex than brute-force lookup. Found that geometry stems from natural spectral bias, not typical supervisory/architectural pressures.

Conclusion: Geometric memory represents a powerful alternative to associative lookup, enabling efficient reasoning through embedding navigation. This geometric view should encourage revisiting intuitions about knowledge acquisition, capacity, discovery, and unlearning in neural networks.

Abstract: Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that – in contrast to prevailing theories – indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

[461] Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

Main category: cs.LG

TL;DR: GRAPE is a unified framework for positional encoding using group actions, with two main variants: Multiplicative GRAPE (rotations in SO(d)) and Additive GRAPE (additive logits from unipotent actions in GL). It generalizes existing methods like RoPE and ALiBi.

Details

Motivation: To create a principled, unified framework for positional encoding that brings together different families of mechanisms under a single group-theoretic foundation, extending beyond existing methods like RoPE and ALiBi.

Method: GRAPE uses group actions for positional encoding: (1) Multiplicative GRAPE uses rotations in SO(d) with position acting as G(n)=exp(nωL) with rank-2 skew generator L, (2) Additive GRAPE uses additive logits from unipotent actions in GL. The framework allows for learned commuting subspaces and non-commuting mixtures for cross-subspace feature coupling.

Result: GRAPE recovers RoPE exactly when using canonical coordinate pairs with log-uniform spectrum, and recovers ALiBi and Forgetting Transformer as exact special cases. It provides a principled design space for positional geometry with O(d) and O(rd) computational costs.

Conclusion: GRAPE offers a unified, group-theoretic framework for positional encoding that subsumes existing methods like RoPE and ALiBi, providing a principled design space for long-context models with efficient computational properties.

Abstract: We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n,ω,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.

[462] Lattice: Learning to Efficiently Compress the Memory

Mahdi Karami, Razvan Pascanu, Vahab Mirrokni

Main category: cs.LG

TL;DR: Lattice is a novel RNN mechanism that compresses KV cache into fixed memory slots using low-rank structure, achieving sub-quadratic complexity with orthogonal updates to minimize interference.

Details

Motivation: Attention mechanisms suffer from quadratic computational complexity, which limits scalability for long sequences. There's a need for more efficient sequence learning methods that maintain performance while reducing computational overhead.

Method: Formulates cache compression as online optimization problem, uses dynamic memory update rule based on gradient descent, implements orthogonal update where each memory slot only incorporates novel information orthogonal to current state, and employs chunk-wise parallelization for training scalability.

Result: Outperforms strong baselines on language modeling and associative recall tasks across diverse context lengths and model sizes, achieving superior memory efficiency with significantly reduced memory sizes.

Conclusion: Lattice provides an efficient, interpretable RNN mechanism with sub-quadratic complexity that effectively addresses the computational limitations of attention mechanisms while maintaining strong performance on sequence learning tasks.

Abstract: Attention mechanisms have revolutionized sequence learning but suffer from quadratic computational complexity. This paper introduces \model, a novel recurrent neural network (RNN) mechanism that leverages the inherent low-rank structure of K-V matrices to efficiently compress the cache into a fixed number of memory slots, achieving sub-quadratic complexity. We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step. The resulting recurrence features a state- and input-dependent gating mechanism, offering an interpretable memory update process. The core innovation is the orthogonal update: each memory slot is updated exclusively with information orthogonal to its current state, hence incorporating only novel, non-redundant data to minimize interference with previously stored information. We derive an efficient computation for this orthogonal update rule and further approximate it with chunk-wise parallelization to ensure training scalability. Empirically, Lattice outperforms strong baselines on language modeling and associative recall tasks across diverse context lengths and model sizes, achieving superior memory efficiency with significantly reduced memory sizes.

[463] Probabilistically Tightened Linear Relaxation-based Perturbation Analysis for Neural Network Verification

Luca Marzari, Ferdinando Cicalese, Alessandro Farinelli

Main category: cs.LG

TL;DR: PT-LiRPA combines LiRPA over-approximation with sampling to compute tight neural network reachable sets, improving verification efficiency with probabilistic guarantees.

Details

Motivation: Existing formal verification methods for neural networks using LiRPA-based approaches produce loose bounds that increase computational costs and sometimes fail on challenging verification problems.

Method: PT-LiRPA integrates LiRPA over-approximation techniques with a sampling-based method to estimate tight intermediate reachable sets, which are then used to tighten the linear bounds of neural network outputs with minimal computational overhead.

Result: The PT-LiRPA-based verifier improves certified robustness bounds by up to 3.31X and 2.26X compared to related work, and successfully solves challenging verification problems where state-of-the-art methods fail, with high confidence (≥99%).

Conclusion: PT-LiRPA provides an effective probabilistic approach that significantly tightens verification bounds, reduces computational costs, and offers a valuable solution for previously unsolvable verification problems while maintaining soundness guarantees.

Abstract: We present $\textbf{P}$robabilistically $\textbf{T}$ightened $\textbf{Li}$near $\textbf{R}$elaxation-based $\textbf{P}$erturbation $\textbf{A}$nalysis ($\texttt{PT-LiRPA}$), a novel framework that combines over-approximation techniques from LiRPA-based approaches with a sampling-based method to compute tight intermediate reachable sets. In detail, we show that with negligible computational overhead, $\texttt{PT-LiRPA}$ exploiting the estimated reachable sets, significantly tightens the lower and upper linear bounds of a neural network’s output, reducing the computational cost of formal verification tools while providing probabilistic guarantees on verification soundness. Extensive experiments on standard formal verification benchmarks, including the International Verification of Neural Networks Competition, show that our $\texttt{PT-LiRPA}$-based verifier improves robustness certificates, i.e., the certified lower bound of $\varepsilon$ perturbation tolerated by the models, by up to 3.31X and 2.26X compared to related work. Importantly, our probabilistic approach results in a valuable solution for challenging competition entries where state-of-the-art formal verification methods fail, allowing us to provide answers with high confidence (i.e., at least 99%).

[464] Fast weight programming and linear transformers: from machine learning to neurobiology

Kazuki Irie, Samuel J. Gershman

Main category: cs.LG

TL;DR: This primer reviews Fast Weight Programmers (FWPs) - 2D-state RNNs with dynamic synaptic weights that serve as short-term memory, exploring their computational characteristics and connections to transformers, state space models, and biological synaptic plasticity.

Details

Motivation: The paper aims to review and establish the foundations of Fast Weight Programmers (FWPs), which represent a novel class of recurrent neural networks with 2D matrix-form hidden states. These models offer a biologically-inspired approach where synaptic weights dynamically change over time as short-term memory, bridging artificial and natural intelligence.

Method: The paper presents a primer/review approach that examines the technical foundations of FWPs, analyzing their computational characteristics and establishing connections to other architectures like transformers and state space models. It also explores biological connections to models of synaptic plasticity in the brain.

Result: The review establishes FWPs as a distinct class of neural networks with 2D matrix-form hidden states that can be interpreted as networks with dynamically changing synaptic weights. These fast weights serve as short-term memory storage, with weight modifications controlled by a programmer network whose parameters are trained via gradient descent.

Conclusion: FWPs represent an important architectural innovation in neural networks that bridges artificial and natural intelligence. Their dynamic weight programming mechanism offers connections to both modern transformer architectures and biological models of synaptic plasticity, suggesting convergence between artificial and natural intelligence approaches.

Abstract: Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence.

[465] Generalising Traffic Forecasting to Regions without Traffic Observations

Xinyu Su, Majid Sarvi, Feng Liu, Egemen Tanin, Jianzhong Qi

Main category: cs.LG

TL;DR: GenCast: A traffic forecasting model for regions without sensors using external knowledge and physics-informed neural networks to compensate for missing observations.

Details

Motivation: Traffic forecasting relies on sensor data, but many regions lack sensors due to high costs. Existing models struggle with generalization to sensor-less regions due to missing historical observations.

Method: GenCast integrates physics-informed neural networks to regularize learning with physical principles, adds external signal learning module to explore correlations with weather and other signals, and includes spatial grouping module to filter localized features that hinder generalization.

Result: Extensive experiments show GenCast consistently reduces forecasting errors on multiple real-world datasets compared to existing approaches.

Conclusion: GenCast effectively addresses traffic forecasting for sensor-less regions by leveraging external knowledge and physical principles, demonstrating improved generalization capabilities.

Abstract: Traffic forecasting is essential for intelligent transportation systems. Accurate forecasting relies on continuous observations collected by traffic sensors. However, due to high deployment and maintenance costs, not all regions are equipped with such sensors. This paper aims to forecast for regions without traffic sensors, where the lack of historical traffic observations challenges the generalisability of existing models. We propose a model named GenCast, the core idea of which is to exploit external knowledge to compensate for the missing observations and to enhance generalisation. We integrate physics-informed neural networks into GenCast, enabling physical principles to regularise the learning process. We introduce an external signal learning module to explore correlations between traffic states and external signals such as weather conditions, further improving model generalisability. Additionally, we design a spatial grouping module to filter localised features that hinder model generalisability. Extensive experiments show that GenCast consistently reduces forecasting errors on multiple real-world datasets.

[466] STRelay: A Universal Spatio-Temporal Relaying Framework for Location Prediction over Human Trajectory Data

Bangchao Deng, Lianhua Ji, Chunhua Chen, Xin Jing, Ling Ding, Bingqing QU, Pengyang Wang, Dingqi Yang

Main category: cs.LG

TL;DR: STRelay is a spatiotemporal relaying framework that improves next location prediction by explicitly modeling future spatiotemporal contexts (time and distance) alongside historical trajectory data.

Details

Motivation: Existing location prediction methods focus only on historical trajectory data and overlook future spatiotemporal contexts, which contain valuable information about how much time and distance users will travel - critical clues for predicting future locations.

Method: STRelay models future spatiotemporal contexts in a relaying manner, integrates them with encoded historical representations from base location prediction models, and uses multi-task learning to simultaneously predict next time interval, next moving distance interval, and next location.

Result: When integrated with five state-of-the-art base models on four real-world datasets, STRelay consistently improves prediction performance by 2.49%-11.30%. It’s particularly effective for entertainment-related locations and users who travel longer distances.

Conclusion: STRelay effectively leverages future spatiotemporal contexts to boost location prediction, especially for non-daily-routine activities with higher uncertainty, complementing base models that excel at modeling regular daily patterns.

Abstract: Next location prediction is a critical task in human mobility modeling, enabling applications like travel planning and urban mobility management. Existing methods mainly rely on historical spatiotemporal trajectory data to train sequence models that directly forecast future locations. However, they often overlook the importance of the future spatiotemporal contexts, which are highly informative for the future locations. For example, knowing how much time and distance a user will travel could serve as a critical clue for predicting the user’s next location. Against this background, we propose \textbf{STRelay}, a universal \textbf{\underline{S}}patio\textbf{\underline{T}}emporal \textbf{\underline{Relay}}ing framework explicitly modeling the future spatiotemporal context given a human trajectory, to boost the performance of different location prediction models. Specifically, STRelay models future spatiotemporal contexts in a relaying manner, which is subsequently integrated with the encoded historical representation from a base location prediction model, enabling multi-task learning by simultaneously predicting the next time interval, next moving distance interval, and finally the next location. We evaluate STRelay integrated with five state-of-the-art location prediction base models on four real-world trajectory datasets. Results demonstrate that STRelay consistently improves prediction performance across all cases by 2.49%-11.30%. Additionally, we find that the future spatiotemporal contexts are particularly helpful for entertainment-related locations and also for user groups who prefer traveling longer distances. The performance gain on such non-daily-routine activities, which often suffer from higher uncertainty, is indeed complementary to the base location prediction models that often excel at modeling regular daily routine patterns.

[467] RAST: A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction

Weilin Ruan, Xilin Dang, Ziyu Zhou, Sisuo Lyu, Yuxuan Liang

Main category: cs.LG

TL;DR: RAST is a retrieval-augmented framework for traffic prediction that addresses limited contextual capacity and heterogeneous patterns by integrating retrieval mechanisms with spatio-temporal modeling.

Details

Motivation: Despite progress in STGNNs and pre-trained models, traffic prediction still faces two key challenges: (1) limited contextual capacity when modeling complex spatio-temporal dependencies, and (2) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns.

Method: RAST integrates retrieval-augmented mechanisms with spatio-temporal modeling through three key designs: 1) Decoupled Encoder and Query Generator for capturing spatial/temporal features and constructing fusion queries, 2) Spatio-temporal Retrieval Store and Retrievers for maintaining and retrieving fine-grained patterns, and 3) Universal Backbone Predictor that accommodates various STGNNs or MLP predictors.

Result: Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.

Conclusion: RAST provides a universal framework that effectively addresses the limitations of existing traffic prediction methods by leveraging retrieval-augmented mechanisms to enhance contextual capacity and handle heterogeneous patterns at fine-grained spatio-temporal points.

Abstract: Traffic prediction is a cornerstone of modern intelligent transportation systems and a critical task in spatio-temporal forecasting. Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have achieved significant progress in traffic prediction, two key challenges remain: (i) limited contextual capacity when modeling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns. Inspired by Retrieval-Augmented Generation (RAG), we propose RAST, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges. Our framework consists of three key designs: 1) Decoupled Encoder and Query Generator to capture decoupled spatial and temporal features and construct a fusion query via residual fusion; 2) Spatio-temporal Retrieval Store and Retrievers to maintain and retrieve vectorized fine-grained patterns; and 3) Universal Backbone Predictor that flexibly accommodates pre-trained STGNNs or simple MLP predictors. Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.

[468] Adversarial Reinforcement Learning Framework for ESP Cheater Simulation

Inkyu Park, Jeong-Gwan Lee, Taehwan Kwon, Juheon Choi, Seungku Kim, Junsu Kim, Kimin Lee

Main category: cs.LG

TL;DR: A simulation framework for modeling ESP cheaters in games using reinforcement learning agents and adversarial game theory to study adaptive cheating behaviors and develop detectors.

Details

Motivation: ESP cheats are hard to detect because their effects aren't directly observable in player behavior, making labeled data collection difficult. Cheaters also adapt their behavior to evade detection, further complicating anti-cheat system development.

Method: Proposed a simulation framework with RL agents representing cheaters (with hidden info access) and non-cheaters. Formulated cheater-detector interaction as adversarial game. Introduced structured cheater model that dynamically switches between cheating and non-cheating based on detection risk.

Result: Framework successfully simulates adaptive cheater behaviors that strategically balance reward optimization and detection evasion. Provides controllable platform for studying cheating behaviors and developing detectors.

Conclusion: The work provides an extensible platform for studying adaptive cheating behaviors and developing effective cheat detectors in gaming environments where ESP cheats are particularly challenging to detect.

Abstract: Extra-Sensory Perception (ESP) cheats, which reveal hidden in-game information such as enemy locations, are difficult to detect because their effects are not directly observable in player behavior. The lack of observable evidence makes it difficult to collect reliably labeled data, which is essential for training effective anti-cheat systems. Furthermore, cheaters often adapt their behavior by limiting or disguising their cheat usage, which further complicates detection and detector development. To address these challenges, we propose a simulation framework for controlled modeling of ESP cheaters, non-cheaters, and trajectory-based detectors. We model cheaters and non-cheaters as reinforcement learning agents with different levels of observability, while detectors classify their behavioral trajectories. Next, we formulate the interaction between the cheater and the detector as an adversarial game, allowing both players to co-adapt over time. To reflect realistic cheater strategies, we introduce a structured cheater model that dynamically switches between cheating and non-cheating behaviors based on detection risk. Experiments demonstrate that our framework successfully simulates adaptive cheater behaviors that strategically balance reward optimization and detection evasion. This work provides a controllable and extensible platform for studying adaptive cheating behaviors and developing effective cheat detectors.

[469] On the limitation of evaluating machine unlearning using only a single training seed

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

Main category: cs.LG

TL;DR: The paper warns that common empirical evaluation practices for machine unlearning algorithms can produce misleading results due to sensitivity to random training seeds, especially for deterministic MU methods.

Details

Motivation: Most machine unlearning algorithms are approximate and require empirical evaluation, but current evaluation practices may produce non-representative results because they don't account for variability across different model training seeds.

Method: The authors demonstrate through analysis that deterministic MU methods are particularly sensitive to the random seed used during initial model training, showing that running MU multiple times from the same trained model can give misleading results.

Result: The paper shows that empirical comparisons of MU algorithms can be highly non-representative when only considering multiple runs from the same trained model, as this fails to capture the variability introduced by different training seeds.

Conclusion: Researchers should evaluate MU algorithms across different model training seeds to get representative empirical comparisons, rather than just running multiple trials from the same trained model.

Abstract: Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because – even for the same architecture and same dataset – some MU methods can be highly sensitive to the choice of random number seed used for model training. We illustrate that this is particularly relevant for MU methods that are deterministic, i.e., which always produce the same result when started from the same trained model. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.

[470] Optimal Approximation – Smoothness Tradeoffs for Soft-Max Functions

Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, Manolis Zampetakis

Main category: cs.LG

TL;DR: The paper analyzes optimal tradeoffs between approximation and smoothness in soft-max functions, introducing novel mechanisms with different optimality properties for various applications.

Details

Motivation: To identify optimal approximation-smoothness tradeoffs for soft-max functions, as different applications require different balances between how well they approximate the maximum function and how sensitive they are to input changes.

Method: Introduces three novel soft-max functions: (1) exponential mechanism (optimal for expected additive approximation with Rényi Divergence smoothness), (2) piecewise linear soft-max (optimal for worst-case additive approximation with ℓq-norm smoothness), and (3) power mechanism (optimal for expected multiplicative approximation with Rényi Divergence smoothness).

Result: The piecewise linear mechanism enforces sparsity in output (important for ML applications) and outperforms exponential mechanism in Mechanism Design/Game Theory. The power mechanism provides improved results in differentially private submodular optimization.

Conclusion: Different soft-max functions are optimal for different applications based on specific approximation-smoothness tradeoffs, with each novel mechanism addressing limitations of existing approaches for particular use cases.

Abstract: A soft-max function has two main efficiency measures: (1) approximation - which corresponds to how well it approximates the maximum function, (2) smoothness - which shows how sensitive it is to changes of its input. Our goal is to identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. This leads to novel soft-max functions, each of which is optimal for a different application. The most commonly used soft-max function, called exponential mechanism, has optimal tradeoff between approximation measured in terms of expected additive approximation and smoothness measured with respect to Rényi Divergence. We introduce a soft-max function, called “piecewise linear soft-max”, with optimal tradeoff between approximation, measured in terms of worst-case additive approximation and smoothness, measured with respect to $\ell_q$-norm. The worst-case approximation guarantee of the piecewise linear mechanism enforces sparsity in the output of our soft-max function, a property that is known to be important in Machine Learning applications [Martins et al. ‘16, Laha et al. ‘18] and is not satisfied by the exponential mechanism. Moreover, the $\ell_q$-smoothness is suitable for applications in Mechanism Design and Game Theory where the piecewise linear mechanism outperforms the exponential mechanism. Finally, we investigate another soft-max function, called power mechanism, with optimal tradeoff between expected \textit{multiplicative} approximation and smoothness with respect to the Rényi Divergence, which provides improved theoretical and practical results in differentially private submodular optimization.

[471] Can machines think efficiently?

Adam Winchell

Main category: cs.LG

TL;DR: The paper proposes updating the Turing Test by adding an energy efficiency constraint to evaluate intelligence through resource consumption, addressing ethical and environmental concerns.

Details

Motivation: The original Turing Test is inadequate because advanced AI systems can already pass it, leading to serious ethical and environmental concerns. There's an urgent need to update the test to account for real-world resource constraints.

Method: The work expands the original Turing Test by adding an energy efficiency constraint - measuring the energy spent answering questions. This forces intelligence evaluation through the lens of resource efficiency.

Result: The proposed new test creates a measurable, practical finish line for evaluating intelligence that the original test lacks, connecting abstract thinking to concrete resource limitations.

Conclusion: The energy-constrained Turing Test compels society to weigh the time savings of AI against its total resource cost, providing a more realistic and practical framework for evaluating machine intelligence.

Abstract: The Turing Test is no longer adequate for distinguishing human and machine intelligence. With advanced artificial intelligence systems already passing the original Turing Test and contributing to serious ethical and environmental concerns, we urgently need to update the test. This work expands upon the original imitation game by accounting for an additional factor: the energy spent answering the questions. By adding the constraint of energy, the new test forces us to evaluate intelligence through the lens of efficiency, connecting the abstract problem of thinking to the concrete reality of finite resources. Further, this proposed new test ensures the evaluation of intelligence has a measurable, practical finish line that the original test lacks. This additional constraint compels society to weigh the time savings of using artificial intelligence against its total resource cost.

[472] Active Learning with Neural Networks: Insights from Nonparametric Statistics

Yinglun Zhu, Robert Nowak

Main category: cs.LG

TL;DR: Deep active learning with neural networks achieves near-optimal label complexity under low noise conditions, with polylog(1/ε) complexity when using abstention.

Details

Motivation: Deep neural networks require large labeled datasets, creating a gap between empirical successes of deep active learning and theoretical guarantees. The paper aims to provide rigorous label complexity guarantees for deep active learning to bridge this theory-practice gap.

Method: Studies deep active learning from nonparametric classification perspective. Uses neural networks under standard low noise conditions, and develops an efficient algorithm with abstention option. Extends analysis beyond Sobolev/Hölder spaces to Radon BV² spaces.

Result: Shows active learning with neural networks achieves minimax label complexity up to disagreement coefficient and logarithmic terms. With abstention, achieves polylog(1/ε) label complexity without low noise assumptions. Provides guarantees for learning in Radon BV² spaces.

Conclusion: The paper provides the first near-optimal label complexity guarantees for deep active learning, bridging theory and practice. Results show neural networks can achieve efficient label complexity under various conditions and function spaces.

Abstract: Deep neural networks have great representation power, but typically require large numbers of training examples. This motivates deep active learning methods that can significantly reduce the amount of labeled training data. Empirical successes of deep active learning have been recently reported in the literature, however, rigorous label complexity guarantees of deep active learning have remained elusive. This constitutes a significant gap between theory and practice. This paper tackles this gap by providing the first near-optimal label complexity guarantees for deep active learning. The key insight is to study deep active learning from the nonparametric classification perspective. Under standard low noise conditions, we show that active learning with neural networks can provably achieve the minimax label complexity, up to disagreement coefficient and other logarithmic terms. When equipped with an abstention option, we further develop an efficient deep active learning algorithm that achieves $\mathsf{polylog}(\frac{1}ε)$ label complexity, without any low noise assumptions. We also provide extensions of our results beyond the commonly studied Sobolev/Hölder spaces and develop label complexity guarantees for learning in Radon $\mathsf{BV}^2$ spaces, which have recently been proposed as natural function spaces associated with neural networks.

[473] The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing

Xingyu Xu, Yandi Shen, Yuejie Chi, Cong Ma

Main category: cs.LG

TL;DR: ScaledGD(λ) is a preconditioned gradient descent method for low-rank matrix sensing that handles unknown rank and ill-conditioned matrices using overparameterized factor representations with damped preconditioning.

Details

Motivation: The paper addresses the challenge of low-rank matrix sensing when the true rank is unknown and matrices may be ill-conditioned. Vanilla gradient descent suffers from slow convergence with polynomial dependency on condition number, especially with overparameterization which introduces bad curvatures.

Method: ScaledGD(λ) uses overparameterized factor representations with small random initialization. It employs gradient descent with a specific form of damped preconditioning to combat bad curvatures from overparameterization and ill-conditioning. The preconditioner adds light computational overhead but improves robustness.

Result: Under Gaussian design, ScaledGD(λ) converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with condition number and problem dimension. This significantly improves over vanilla GD’s polynomial dependency on condition number.

Conclusion: Preconditioning can accelerate convergence without hurting generalization in overparameterized learning. ScaledGD(λ) demonstrates remarkable robustness to ill-conditioning compared to vanilla GD, even with overparameterization.

Abstract: We propose $\textsf{ScaledGD($λ$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($λ$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($λ$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($λ$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.

[474] HiGen: Hierarchical Graph Generative Networks

Mahdi Karami

Main category: cs.LG

TL;DR: HiGen: A hierarchical graph generative network that captures graph hierarchy and generates graphs in coarse-to-fine fashion, achieving state-of-the-art performance.

Details

Motivation: Most real-world graphs have hierarchical structures that existing graph generation methods overlook, limiting their ability to capture complex graph patterns.

Method: Proposes a hierarchical generative model that generates communities in parallel at each level, then predicts cross-edges between communities using separate neural networks. Uses multinomial distribution for edge weights and recursive factorization for autoregressive generation.

Result: Empirical studies show state-of-the-art performance in graph quality across various benchmark datasets, with scalable generation for large and complex graphs.

Conclusion: The proposed hierarchical approach effectively captures graph structure, enables scalable generation, and outperforms existing methods in generating realistic graphs.

Abstract: Most real-world graphs exhibit a hierarchical structure, which is often overlooked by existing graph generation methods. To address this limitation, we propose a novel graph generative network that captures the hierarchical nature of graphs and successively generates the graph sub-structures in a coarse-to-fine fashion. At each level of hierarchy, this model generates communities in parallel, followed by the prediction of cross-edges between communities using separate neural networks. This modular approach enables scalable graph generation for large and complex graphs. Moreover, we model the output distribution of edges in the hierarchical graph with a multinomial distribution and derive a recursive factorization for this distribution. This enables us to generate community graphs with integer-valued edge weights in an autoregressive manner. Empirical studies demonstrate the effectiveness and scalability of our proposed generative model, achieving state-ofthe-art performance in terms of graph quality across various benchmark datasets. The code is available at https://github.com/Karami-m/HiGen_main.

[475] Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling

Mahdi Karami, Ali Ghodsi

Main category: cs.LG

TL;DR: Orchid introduces a novel architecture with data-dependent global convolution to replace quadratic attention, achieving high expressivity with quasilinear scaling for long sequences.

Details

Motivation: Address the quadratic complexity of traditional attention mechanisms while maintaining ability to capture long-range dependencies and in-context learning.

Method: Uses data-dependent global convolution layer with conditioning neural networks that maintain shift equivariance, enabling dynamic kernel adaptation based on input sequences.

Result: Outperforms BERT and Vision Transformers with smaller model sizes, extends feasible sequence length beyond dense attention limitations across language modeling and image classification tasks.

Conclusion: Orchid represents a significant step toward more efficient and scalable deep learning models for sequence modeling by replacing quadratic attention with quasilinear data-dependent convolution.

Abstract: In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling. The code is available at https://github.com/Karami-m/orchid.

[476] Spectral Convolutional Conditional Neural Processes

Peiman Mohseni, Nick Duffield

Main category: cs.LG

TL;DR: Spectral ConvCNPs (SConvCNPs) enhance neural processes by performing global convolution in the frequency domain to capture long-range dependencies efficiently, overcoming limitations of local CNN kernels.

Details

Motivation: Neural Processes need to model infinite-dimensional stochastic processes, but early versions used finite-dimensional global representations via aggregation. ConvCNPs addressed this with functional embeddings but CNNs with local kernels struggle with long-range dependencies without expensive large kernels.

Method: Propose Spectral ConvCNPs (SConvCNPs) that perform global convolution in the frequency domain, inspired by Fourier neural operators. Directly parameterize convolution kernels in the frequency domain to leverage compact Fourier representations of natural signals.

Result: Validated effectiveness on both synthetic and real-world datasets, demonstrating improved capabilities for capturing long-range dependencies while maintaining computational efficiency.

Conclusion: SConvCNPs advance neural process capabilities by applying operator learning ideas, specifically frequency-domain global convolution, to overcome limitations of local spatial kernels in capturing long-range dependencies.

Abstract: Neural Processes (NPs) are meta-learning models that learn to map sets of observations to approximations of the corresponding posterior predictive distributions. By accommodating variable-sized, unstructured collections of observations and enabling probabilistic predictions at arbitrary query points, NPs provide a flexible framework for modeling functions over continuous domains. Since their introduction, numerous variants have emerged; however, early formulations shared a fundamental limitation: they compressed the observed data into finite-dimensional global representations via aggregation operations such as mean pooling. This strategy induces an intrinsic mismatch with the infinite-dimensional nature of the stochastic processes that NPs intend to model. Convolutional conditional neural processes (ConvCNPs) address this limitation by constructing infinite-dimensional functional embeddings processed through convolutional neural networks (CNNs) to enforce translation equivariance. Yet CNNs with local spatial kernels struggle to capture long-range dependencies without resorting to large kernels, which impose significant computational costs. To overcome this limitation, we propose spectral ConvCNPs (SConvCNPs), which perform global convolution in the frequency domain. Inspired by Fourier neural operators (FNOs) for learning solution operators of partial differential equations (PDEs), our approach directly parameterizes convolution kernels in the frequency domain, leveraging the relatively compact yet global Fourier representation of many natural signals. We validate the effectiveness of SConvCNPs on both synthetic and real-world datasets, demonstrating how ideas from operator learning can advance the capabilities of NPs.

[477] Jacobian-Enhanced Neural Networks

Steven H. Berguin

Main category: cs.LG

TL;DR: JENN improves neural network accuracy with fewer training points by predicting partial derivatives, making it superior for surrogate-based optimization where gradient accuracy is critical.

Details

Motivation: In computer-aided design, there's a need to replace expensive physics-based models with fast surrogate models. Gradient-enhanced methods are particularly valuable for surrogate-based optimization, but standard neural networks don't accurately predict partial derivatives, limiting their effectiveness in optimization applications.

Method: JENN modifies the training process of densely connected multi-layer perceptrons to accurately predict partial derivatives, creating a Jacobian-Enhanced Neural Network that learns both function values and their derivatives.

Result: JENN achieves better accuracy with fewer training points compared to standard neural networks, and provides accurate partial derivatives that are critical for surrogate-based optimization applications.

Conclusion: JENN demonstrates superiority over standard neural networks for surrogate-based optimization by providing accurate gradient information, making it particularly valuable in computer-aided design and other fields requiring fast, accurate surrogate models for optimization.

Abstract: Jacobian-Enhanced Neural Networks (JENN) are densely connected multi-layer perceptrons, whose training process is modified to predict partial derivatives accurately. Their main benefit is better accuracy with fewer training points compared to standard neural networks. These attributes are particularly desirable in the field of computer-aided design, where there is often the need to replace computationally expensive, physics-based models with fast running approximations, known as surrogate models or meta-models. Since a surrogate emulates the original model accurately in near-real time, it yields a speed benefit that can be used to carry out orders of magnitude more function calls quickly. However, in the special case of gradient-enhanced methods, there is the additional value proposition that partial derivatives are accurate, which is a critical property for one important use-case: surrogate-based optimization. This work derives the complete theory and exemplifies its superiority over standard neural nets for surrogate-based optimization.

[478] UnPaSt: unsupervised patient stratification by biclustering of omics data

Michael Hartung, Andreas Maier, Yuliya Burankova, Fernando Delgado-Chaves, Olga I. Isaeva, Alexey Savchik, Fábio Malta de Sá Patroni, Jens J. G. Lohmann, Daniel He, Casey Shannon, Jan-Ole Schulze, Katharina Kaufmann, Zoe Chervontseva, Farzaneh Firoozbakht, Anne Hartebrodt, Niklas Probul, Olga Tsoy, Alexandra Abisheva, Evgenia Zotova, Kavya Singh, Kristel Van Steen, Malte Kuehl, Victor G. Puelles, David B. Blumenthal, Martin Ester, Tanja Laske, Jan Baumbach, Olga Zolotareva

Main category: cs.LG

TL;DR: UnPaSt is a novel biclustering algorithm for unsupervised patient stratification that outperforms existing methods in identifying disease subtypes, especially for non-mutually exclusive subtypes or those with few biomarkers.

Details

Motivation: Current unsupervised patient stratification methods are primarily benchmarked on cancers with mutually exclusive, well-differentiated subtypes, but they perform poorly for non-oncological diseases with non-mutually exclusive subtypes or subtypes discriminated by few biomarkers.

Method: Developed UnPaSt, a biclustering algorithm based on differentially expressed biclusters. Evaluated 22 unsupervised methods (clustering and biclustering) using simulated and real transcriptomics data before developing the new approach.

Result: UnPaSt outperformed widely used patient stratification methods in identifying known subtypes of breast cancer and asthma. It detected biologically insightful patterns across diverse data types including bulk transcriptomics, proteomics, single-cell, spatial transcriptomics, and multi-omics datasets.

Conclusion: UnPaSt provides a more nuanced and interpretable view of high-throughput data heterogeneity than traditional methods, advancing precision medicine by enabling better disease subtype discovery for complex, non-oncological diseases.

Abstract: Unsupervised patient stratification is essential for disease subtype discovery, yet, despite growing evidence of molecular heterogeneity of non-oncological diseases, popular methods are benchmarked primarily using cancers with mutually exclusive molecular subtypes well-differentiated by numerous biomarkers. Evaluating 22 unsupervised methods, including clustering and biclustering, using simulated and real transcriptomics data revealed their inefficiency in scenarios with non-mutually exclusive subtypes or subtypes discriminated only by few biomarkers. To address these limitations and advance precision medicine, we developed UnPaSt, a novel biclustering algorithm for unsupervised patient stratification based on differentially expressed biclusters. UnPaSt outperformed widely used patient stratification approaches in the de novo identification of known subtypes of breast cancer and asthma. In addition, it detected many biologically insightful patterns across bulk transcriptomics, proteomics, single-cell, spatial transcriptomics, and multi-omics datasets, enabling a more nuanced and interpretable view of high-throughput data heterogeneity than traditionally used methods.

[479] OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction

Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh

Main category: cs.LG

TL;DR: OPTIMA: A practical one-shot post-training pruning method that uses row-wise Quadratic Programs with shared Hessian structure for optimal weight updates, achieving better accuracy-efficiency trade-offs than existing methods.

Details

Motivation: Current post-training pruning methods face a trade-off: simple heuristics are fast but degrade accuracy, while principled optimization methods recover accuracy but are computationally infeasible at modern scale. There's a need for a method that balances accuracy and scalability.

Method: OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. It implements an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling efficient one-shot pruning without fine-tuning.

Result: OPTIMA achieves up to 3.97% absolute accuracy improvement across multiple LLM families and sparsity regimes. It prunes an 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory on an NVIDIA H100, setting new state-of-the-art accuracy-efficiency trade-offs.

Conclusion: OPTIMA provides a practical solution to the accuracy-scalability trade-off in post-training pruning, offering globally optimal weight updates through efficient QP solving while maintaining computational feasibility at modern model scales.

Abstract: Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.

[480] Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching

Etrit Haxholli, Yeti Z. Gürbüz, Oğul Can, Eli Waxman

Main category: cs.LG

TL;DR: Discrete flow matching for categorical data with dynamic-optimal-transport minimization reduces state transitions 8x while maintaining performance, plus new perplexity bounds and Multimask Flows that outperform masked flows.

Details

Motivation: Discrete flow matching shows competitive performance but faces challenges: rectification strategy can't be applied due to stochastic discrete paths, and path nondeterminism prevents precise probability estimation like continuous flows.

Method: Proposes dynamic-optimal-transport-like minimization objective with Kantorovich formulation for discrete flows with convex interpolants. Introduces two upper bounds on perplexity for principled training/evaluation. Develops Multimask Flows that outperform masked flows.

Result: For BoW-sourced flows, reduces transitions up to 8 times (1024 to 128) to reach same generative perplexity without compromising diversity. Multimask Flows with minibatch Optimal Transport outperform masked flows in perplexity while maintaining diversity.

Conclusion: The proposed methods enable efficient discrete flow matching with reduced state transitions, principled training via perplexity bounds, and improved performance with Multimask Flows, advancing categorical data modeling beyond autoregressive approaches.

Abstract: Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamic-optimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state similarity and can be optimized via minibatch strategies. In the case of bag-of-words (BoW) sourced flows, we show that such methods can reduce the number of transitions up to 8 times (1024 to 128) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability estimation available to continuous flows. We therefore propose two upper bounds on perplexity, enabling principled training, evaluation and model comparison. Finally, we introduce Multimask Flows which outperform masked flows in generative perplexity, particularly when utilizing minibatch Optimal Transport, without sacrificing diversity.

[481] The Generalization Error of Supervised Machine Learning Algorithms

Samir M. Perlaza, Xinying Zou

Main category: cs.LG

TL;DR: Introduces “method of gaps” for deriving closed-form expressions of generalization error using information measures, distinguishing algorithm-driven and data-driven gaps.

Details

Motivation: To develop a unified framework for deriving exact expressions of generalization error in supervised machine learning using information-theoretic measures, connecting existing results and enabling new insights.

Method: Uses “gaps” concept measuring variation of expected empirical risk when either model or dataset is fixed, resulting in algorithm-driven gaps (fixed dataset) and data-driven gaps (fixed model). Both types have closed-form expressions in terms of relative entropies involving Gibbs probability measures.

Result: Method can reproduce all existing exact expressions for generalization error and yields numerous new exact expressions, establishing connections with other statistical areas.

Conclusion: The method of gaps provides a systematic approach to analyze generalization error, revealing Gibbs measures as natural references for supervised learning algorithms and worst-case data-generating distributions.

Abstract: In this paper, the method of gaps, a technique for deriving closed-form expressions in terms of information measures for the generalization error of supervised machine learning algorithms is introduced. The method relies on the notion of \emph{gaps}, which characterize the variation of the expected empirical risk (when either the model or dataset is kept fixed) with respect to changes in the probability measure on the varying parameter (either the dataset or the model, respectively). This distinction results in two classes of gaps: Algorithm-driven gaps (fixed dataset) and data-driven gaps (fixed model). In general, the method relies on two central observations: $(i)$~The generalization error is the expectation of an algorithm-driven gap or a data-driven gap. In the first case, the expectation is with respect to a measure on the datasets; and in the second case, with respect to a measure on the models. $(ii)$~Both, algorithm-driven gaps and data-driven gaps exhibit closed-form expressions in terms of relative entropies. In particular, algorithm-driven gaps involve a Gibbs probability measure on the set of models, which represents a supervised Gibbs algorithm. Alternatively, data-driven gaps involve a worst-case data-generating (WCDG) probability measure on the set of data points, which is also a Gibbs probability measure. Interestingly, such Gibbs measures, which are exogenous to the analysis of generalization, place both the supervised Gibbs algorithm and the WCDG probability measure as natural references for the analysis of supervised learning algorithms. All existing exact expressions for the generalization error of supervised machine learning algorithms can be obtained with the proposed method. Also, this method allows obtaining numerous new exact expressions, which allows establishing connections with other areas in statistics.

[482] Private Linear Regression with Differential Privacy and PAC Privacy

Hillary Yang, Yuntao Du

Main category: cs.LG

TL;DR: Systematic comparison of linear regression models trained with differential privacy vs. PAC privacy across three real-world datasets, revealing key findings about performance trade-offs.

Details

Motivation: Linear regression is fundamental for statistical analysis, and while differential privacy has been well-established for privacy-preserving linear regression, the newly proposed PAC Privacy framework hasn't been explored in this context. The paper aims to fill this gap by comparing both approaches.

Method: The authors systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets. They evaluate the performance of both privacy-preserving approaches in practical applications.

Result: The study observes several key findings that impact the performance of privacy-preserving linear regression. These findings likely include trade-offs between privacy guarantees, model accuracy, and computational efficiency between the two approaches.

Conclusion: The comparison between differential privacy and PAC privacy for linear regression provides important insights for practitioners choosing privacy-preserving methods, highlighting the relative strengths and limitations of each approach in real-world applications.

Abstract: Linear regression is a fundamental tool for statistical analysis, which has motivated the development of linear regression methods that satisfy provable privacy guarantees so that the learned model reveals little about any one data point used to construct it. Most existing privacy-preserving linear regression methods rely on the well-established framework of differential privacy, while the newly proposed PAC Privacy has not yet been explored in this context. In this paper, we systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets, observing several key findings that impact the performance of privacy-preserving linear regression.

[483] Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models

Zong Ke, Shicheng Zhou, Yining Zhou, Chia Hong Chang, Rong Zhang

Main category: cs.LG

TL;DR: This paper proposes a GAN-based model to detect AI deepfakes and fraudulent activities in online payment systems, achieving over 95% detection accuracy for distinguishing legitimate transactions from manipulated images.

Details

Motivation: The growing prevalence of deepfake technology that manipulates facial features in images/videos has escalated fraud potential in online transactions, while traditional security systems struggle to identify these sophisticated fraud forms.

Method: The research proposes a novel GAN-based model trained on a dataset of real-world online payment images and deepfake images generated using advanced GAN architectures like StyleGAN and DeepFake, focusing on identifying subtle manipulations in payment images.

Result: The proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%, significantly improving payment system robustness against AI-driven fraud.

Conclusion: This approach contributes to digital security by demonstrating the effective application of GANs for fraud detection in financial services, offering enhanced protection against sophisticated AI-driven fraudulent activities in online payment systems.

Abstract: This study explores the use of Generative Adversarial Networks (GANs) to detect AI deepfakes and fraudulent activities in online payment systems. With the growing prevalence of deepfake technology, which can manipulate facial features in images and videos, the potential for fraud in online transactions has escalated. Traditional security systems struggle to identify these sophisticated forms of fraud. This research proposes a novel GAN-based model that enhances online payment security by identifying subtle manipulations in payment images. The model is trained on a dataset consisting of real-world online payment images and deepfake images generated using advanced GAN architectures, such as StyleGAN and DeepFake. The results demonstrate that the proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%. This approach significantly improves the robustness of payment systems against AI-driven fraud. The paper contributes to the growing field of digital security, offering insights into the application of GANs for fraud detection in financial services. Keywords- Payment Security, Image Recognition, Generative Adversarial Networks, AI Deepfake, Fraudulent Activities

[484] Knowledge-Driven Federated Graph Learning on Model Heterogeneity

Zhengyu Wu, Guang Zeng, Huilin Lai, Daohan Su, Jishuo Jia, Yinlin Zhu, Xunkai Li, Rong-Hua Li, Guoren Wang, Chenghu Zhou

Main category: cs.LG

TL;DR: FedGKC framework addresses model-centric heterogeneous federated graph learning by introducing client copilot models for knowledge exchange and server-side knowledge-aware aggregation, achieving 3.88% average accuracy gain.

Details

Motivation: Existing federated graph learning approaches assume homogeneous client models, but real-world scenarios involve organizations using GNNs of different scales and architectures (model-centric heterogeneous FGL). This architectural diversity undermines server-side aggregation and complicates knowledge transfer across clients.

Method: FedGKC introduces lightweight Copilot Models on each client to facilitate knowledge exchange despite heterogeneous architectures. It employs: 1) Client-side Self-Mutual Knowledge Distillation - bidirectional distillation between local and copilot models with multi-view perturbation; 2) Server-side Knowledge-Aware Model Aggregation - dynamically assigns weights based on client-provided knowledge.

Result: Extensive experiments on eight benchmark datasets show FedGKC achieves average accuracy gain of 3.88% over baselines in model-centric heterogeneous FGL scenarios, while maintaining excellent performance in homogeneous settings.

Conclusion: FedGKC effectively addresses model-centric heterogeneous federated graph learning by enabling knowledge collaboration across architecturally diverse clients through copilot models and knowledge-aware aggregation, demonstrating significant performance improvements.

Abstract: Federated graph learning (FGL) has emerged as a promising paradigm for collaborative graph representation learning, enabling multiple parties to jointly train models while preserving data privacy. However, most existing approaches assume homogeneous client models and largely overlook the challenge of model-centric heterogeneous FGL (MHtFGL), which frequently arises in practice when organizations employ graph neural networks (GNNs) of different scales and architectures.Such architectural diversity not only undermines smooth server-side aggregation, which presupposes a unified representation space shared across clients’ updates, but also further complicates the transfer and integration of structural knowledge across clients. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework. FedGKC introduces a lightweight Copilot Model on each client to facilitate knowledge exchange while local architectures are heterogeneous across clients, and employs two complementary mechanisms: Client-side Self-Mutual Knowledge Distillation, which transfers effective knowledge between local and copilot models through bidirectional distillation with multi-view perturbation; and Server-side Knowledge-Aware Model Aggregation, which dynamically assigns aggregation weights based on knowledge provided by clients. Extensive experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy gain of 3.88% over baselines in MHtFGL scenarios, while maintaining excellent performance in homogeneous settings.

Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

Main category: cs.LG

TL;DR: TMO is a local-cloud LLM inference system with three-M offloading (multi-modal, multi-task, multi-dialogue) that uses reinforcement learning to optimize where to process tasks between local and cloud LLMs, balancing response quality, latency, and cost.

Details

Motivation: LLMs present deployment challenges: local deployment faces computational/memory/energy constraints, while cloud deployment lacks real-time guarantees and incurs communication costs. Need a hybrid approach that leverages both local and cloud resources efficiently.

Method: TMO combines lightweight local LLM for simple tasks and large cloud LLM for multi-modal data. Uses resource-constrained reinforcement learning (RCRL) to optimize inference location (local vs cloud) and multi-modal data sources per task/dialogue, maximizing long-term reward (quality, latency, cost) under constraints.

Result: TMO outperforms exploration-decision and LLM-as-Agent baselines with significant improvements in latency, cost, and response quality. Authors also created M4A1 dataset containing reward/cost metrics across multiple modality, task, dialogue, and LLM configurations.

Conclusion: TMO effectively addresses LLM deployment challenges through intelligent local-cloud offloading, demonstrating that hybrid systems with optimized resource allocation can deliver better performance than pure local or cloud approaches.

Abstract: Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

[486] Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs

Pascal Passigan, Kevin Zhu, Angelina Ning

Main category: cs.LG

TL;DR: A graph neural network framework predicts drug-induced gene expression changes by integrating biomedical knowledge graphs with multimodal embeddings, outperforming baseline models and demonstrating the value of graph structure for mechanistic drug modeling.

Details

Motivation: Current deep learning frameworks focus on link prediction and binary drug-disease associations rather than gene perturbation prediction, which could reveal mechanistic transcriptomic effects of drugs for understanding drug mechanisms, predicting off-target effects, and identifying repurposing opportunities.

Method: Constructed a merged biomedical graph integrating PrimeKG++ (augmented with semantic embeddings) and LINCS L1000 drug/cell line nodes with multimodal embeddings from foundation models (MolFormerXL, BioBERT). Trained a graph attention network (GAT) with prediction head to learn delta expression profiles of 978 landmark genes for drug-cell pairs.

Result: The framework outperforms MLP baselines for differentially expressed genes prediction under scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization show that biomedical knowledge graph edges enhance perturbation-level prediction.

Conclusion: The framework provides a path toward mechanistic drug modeling by moving beyond binary drug-disease associations to predict granular transcriptional effects of therapeutic interventions, demonstrating the value of integrating biomedical knowledge graphs with multimodal embeddings.

Abstract: Understanding how small molecules perturb gene expression is essential for uncovering drug mechanisms, predicting off-target effects, and identifying repurposing opportunities. While prior deep learning frameworks have integrated multimodal embeddings into biomedical knowledge graphs (BKGs) and further improved these representations through graph neural network message-passing paradigms, these models have been applied to tasks such as link prediction and binary drug-disease association, rather than the task of gene perturbation, which may unveil more about mechanistic transcriptomic effects. To address this gap, we construct a merged biomedical graph that integrates (i) PrimeKG++, an augmentation of PrimeKG containing semantically rich embeddings for nodes with (ii) LINCS L1000 drug and cell line nodes, initialized with multimodal embeddings from foundation models such as MolFormerXL and BioBERT. Using this heterogeneous graph, we train a graph attention network (GAT) with a downstream prediction head that learns the delta expression profile of over 978 landmark genes for a given drug-cell pair. Our results show that our framework outperforms MLP baselines for differentially expressed genes (DEG) – which predict the delta expression given a concatenated embedding of drug features, target features, and baseline cell expression – under the scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization further demonstrate that the edges provided by biomedical KGs enhance perturbation-level prediction. More broadly, our framework provides a path toward mechanistic drug modeling: moving beyond binary drug-disease association tasks to granular transcriptional effects of therapeutic intervention.

[487] Revisiting Agnostic Boosting

Arthur da Cunha, Mikael Møller Høgsgaard, Andrea Paudice, Yuxin Sun

Main category: cs.LG

TL;DR: The paper proposes a new agnostic boosting algorithm with significantly improved sample complexity and provides nearly-matching lower bounds, settling the sample complexity of agnostic boosting up to logarithmic factors.

Details

Motivation: While boosting is well-studied in the realizable case (where labels follow a pattern), the statistical properties of weak-to-strong learning remain less understood in the agnostic setting where there are no assumptions on label distribution. The authors aim to address this gap by developing better agnostic boosting algorithms.

Method: The approach uses a reduction to the realizable case followed by margin-based filtering of high-quality hypotheses. This allows the algorithm to identify and leverage the most reliable weak learners in the agnostic setting.

Result: The proposed algorithm achieves substantially improved sample complexity compared to prior works under very general assumptions. The authors also show a nearly-matching lower bound, establishing the optimal sample complexity of agnostic boosting up to logarithmic factors.

Conclusion: The work settles the sample complexity of agnostic boosting, providing both an efficient algorithm with improved performance and matching lower bounds that establish the fundamental limits of what can be achieved in this setting.

Abstract: Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remain less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially improved sample complexity compared to prior works under very general assumptions. Our approach is based on a reduction to the realizable case, followed by a margin-based filtering of high-quality hypotheses. Furthermore, we show a nearly-matching lower bound, settling the sample complexity of agnostic boosting up to logarithmic factors.

[488] Adjusted Count Quantification Learning on Graphs

Clemens Damke, Eyke Hüllermeier

Main category: cs.LG

TL;DR: Extends Adjusted Classify & Count (ACC) to graphs, proposes structural importance sampling (SIS) for covariate shift, and Neighborhood-aware ACC for non-homophilic edges.

Details

Motivation: Quantification learning for graph-structured data has only been addressed via node clustering methods, and the prior probability shift assumption of ACC is often not applicable to graph quantification problems.

Method: Extends ACC to graphs, proposes structural importance sampling (SIS) for structural covariate shift, and develops Neighborhood-aware ACC to handle non-homophilic edges.

Result: Shows effectiveness of proposed techniques on multiple graph quantification tasks, addressing limitations of previous approaches.

Conclusion: Introduces first graph quantification method applicable under structural covariate shift and improves quantification in presence of non-homophilic edges.

Abstract: Quantification learning is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular Adjusted Classify & Count (ACC) method to graphs. We show that the prior probability shift assumption upon which ACC relies is often not applicable to graph quantification problems. To address this issue, we propose structural importance sampling (SIS), the first graph quantification method that is applicable under (structural) covariate shift. Additionally, we propose Neighborhood-aware ACC, which improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

Santosh Rajagopalan, Jonathan Vronsky, Songbai Yan, S. Alireza Golestaneh, Shubhra Chandra, Min Zhou

Main category: cs.LG

TL;DR: ALF is a multi-modal transformer model for understanding advertiser behavior across text, image, video, and structured data, achieving SOTA performance on fraud detection, policy violation identification, and advertiser similarity matching.

Details

Motivation: To create a unified understanding of advertiser behavior and intent across multiple data modalities (text, image, video, structured data) for improved advertising platform safety and effectiveness.

Method: Multi-modal transformer architecture using contrastive learning and multi-task optimization with novel components including multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

Result: State-of-the-art performance on critical tasks: achieved over 40 percentage points recall boost on one critical policy, 99.8% precision on another, with simultaneous gains in both precision and recall in production deployment.

Conclusion: ALF effectively creates unified advertiser representations that capture both content and behavioral patterns, demonstrating significant real-world impact through its novel architectural components and multi-modal approach.

Abstract: We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video, and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF demonstrates significant real-world impact by delivering simultaneous gains in both precision and recall, for instance boosting recall by over 40 percentage points on one critical policy and increasing precision to 99.8% on another. The architecture’s effectiveness stems from its novel combination of multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

[490] Not All Tokens Are Meant to Be Forgotten

Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Douglas Zytko, Prashant Khanduri, Dongxiao Zhu

Main category: cs.LG

TL;DR: TIF framework improves LLM unlearning by selectively targeting unwanted information while preserving general knowledge, reducing over-forgetting.

Details

Motivation: LLMs memorize private/copyrighted content, raising privacy/legal concerns. Existing unlearning methods cause over-forgetting, degrading model utility.

Method: Targeted Information Forgetting (TIF) framework: (1) targeted information identifier to distinguish unwanted vs general words, (2) Targeted Preference Optimization with Logit Preference Loss for unlearning and Preservation Loss for retaining general information.

Result: Extensive experiments on TOFU and MUSE benchmarks show TIF enhances unlearning effectiveness while preserving model utility, achieving state-of-the-art results.

Conclusion: TIF framework effectively addresses over-forgetting in LLM unlearning by selectively targeting unwanted information, balancing privacy protection with model utility preservation.

Abstract: Large Language Models (LLMs), pre-trained on massive text corpora, exhibit remarkable human-level language understanding, reasoning, and decision-making abilities. However, they tend to memorize unwanted information, such as private or copyrighted content, raising significant privacy and legal concerns. Unlearning has emerged as a promising solution, but existing methods face a significant challenge of over-forgetting. This issue arises because they indiscriminately suppress the generation of all the tokens in forget samples, leading to a substantial loss of model utility. To overcome this challenge, we introduce the Targeted Information Forgetting (TIF) framework, which consists of (1) a flexible targeted information identifier designed to differentiate between unwanted words (UW) and general words (GW) in the forget samples, and (2) a novel Targeted Preference Optimization approach that leverages Logit Preference Loss to unlearn unwanted information associated with UW and Preservation Loss to retain general information in GW, effectively improving the unlearning process while mitigating utility degradation. Extensive experiments on the TOFU and MUSE benchmarks demonstrate that the proposed TIF framework enhances unlearning effectiveness while preserving model utility and achieving state-of-the-art results.

[491] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kelu Yao, Sixu Lin, Litao Liu, Changqing Zou

Main category: cs.LG

TL;DR: BiTrajDiff introduces a bidirectional trajectory diffusion framework for offline RL that generates both future and history trajectories from intermediate states, addressing distribution bias through enhanced data augmentation.

Details

Motivation: Current offline RL methods suffer from distribution bias in static datasets, limiting generalizability. Existing data augmentation techniques only reconstruct future trajectories from given states, ignoring history transitions that could reveal diverse behavior patterns and critical states leading to high-reward outcomes.

Method: BiTrajDiff decomposes trajectory generation into two independent diffusion processes: forward diffusion for predicting future dynamics and backward diffusion for tracing essential history transitions. The framework uses critical states as anchors to expand into underexplored regions of the state space.

Result: Extensive experiments on the D4RL benchmark demonstrate that BiTrajDiff achieves superior performance compared to other advanced data augmentation methods across various offline RL backbones.

Conclusion: The bidirectional approach to trajectory generation effectively addresses dataset diversity limitations in offline RL, enabling better exploration of valuable state-space regions and improving policy learning through more comprehensive data augmentation.

Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

[492] Mathematical artificial data for operator learning

Heng Wu, Benzhuo Lu

Main category: cs.LG

TL;DR: MAD framework integrates physical laws with data-driven learning for differential equation operator discovery using physics-embedded synthetic data, eliminating need for experimental/simulated training data.

Details

Motivation: Current machine learning approaches for solving differential equations have limitations: data-driven methods require expensive labeled datasets, while model-driven methods face efficiency-accuracy trade-offs. There's a need for a framework that combines mathematical rigor with computational efficiency.

Method: The Mathematical Artificial Data (MAD) framework exploits differential equations’ intrinsic mathematical structure to generate physics-embedded analytical solutions and associated synthetic data. This enables operator learning across multi-parameter systems without relying on experimental or simulated training data.

Result: MAD demonstrates generalizability and superior efficiency/accuracy across various differential equation scenarios, particularly in 2D parametric problems where both boundary values and source terms are functions. The framework handles complex parameter spaces effectively.

Conclusion: MAD represents a physics-embedded-data-driven framework with potential to become a universal paradigm for physics-informed machine intelligence in scientific computing, combining mathematical rigor with data-driven learning for efficient operator discovery.

Abstract: Machine learning has emerged as a transformative tool for solving differential equations (DEs), yet prevailing methodologies remain constrained by dual limitations: data-driven methods demand costly labeled datasets while model-driven techniques face efficiency-accuracy trade-offs. We present the Mathematical Artificial Data (MAD) framework, a new paradigm that integrates physical laws with data-driven learning to facilitate large-scale operator discovery. By exploiting DEs’ intrinsic mathematical structure to generate physics-embedded analytical solutions and associated synthetic data, MAD fundamentally eliminates dependence on experimental or simulated training data. This enables computationally efficient operator learning across multi-parameter systems while maintaining mathematical rigor. Through numerical demonstrations spanning 2D parametric problems where both the boundary values and source term are functions, we showcase MAD’s generalizability and superior efficiency/accuracy across various DE scenarios. This physics-embedded-data-driven framework and its capacity to handle complex parameter spaces gives it the potential to become a universal paradigm for physics-informed machine intelligence in scientific computing.

[493] Sampling from Gaussian Processes: A Tutorial and Applications in Global Sensitivity Analysis and Optimization

Bach Do, Nafeezat A. Ajenifuja, Taiwo A. Adebiyi, Ruda Zhang

Main category: cs.LG

TL;DR: The paper presents efficient sampling methods for Gaussian processes to enable global sensitivity analysis and optimization when high-fidelity simulations are too expensive.

Details

Motivation: High-fidelity simulations and physical experiments are too expensive for global sensitivity analysis (GSA) and optimization tasks. While Gaussian processes serve as proxy models, direct sampling from them is computationally inefficient due to infinite-dimensional nature and large covariance matrix operations.

Method: The paper formulates and implements two sampling methods: random Fourier features and pathwise conditioning for generating posterior samples from Gaussian processes at reduced computational cost. Alternative approaches are also briefly described.

Result: The paper demonstrates successful applications of these sampling methods through a series of numerical examples, showing how generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization.

Conclusion: Efficient sampling from Gaussian processes enables practical implementation of global sensitivity analysis and optimization tasks that would otherwise be prohibitively expensive with high-fidelity simulations, bridging a gap between machine learning methods and engineering optimization applications.

Abstract: High-fidelity simulations and physical experiments are essential for engineering analysis and design, yet their high cost often makes two critical tasks–global sensitivity analysis (GSA) and optimization–prohibitively expensive. This limitation motivates the common use of Gaussian processes (GPs) as proxy regression models that provide uncertainty-aware predictions from a limited number of high-quality observations. GPs naturally enable efficient sampling strategies that support informed decision-making under uncertainty by extracting information from a subset of possible functions for the model of interest. However, direct sampling from GPs is inefficient due to their infinite-dimensional nature and the high cost associated with large covariance matrix operations. Despite their popularity in machine learning and statistics communities, sampling from GPs has received little attention in the community of engineering optimization. In this paper, we present the formulation and detailed implementation of two notable sampling methods–random Fourier features and pathwise conditioning–for generating posterior samples from GPs at reduced computational cost. Alternative approaches are briefly described. Importantly, we detail how the generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization. We show successful applications of these sampling methods through a series of numerical examples.

[494] Learning Network Dismantling Without Handcrafted Inputs

Haozhe Tian, Pietro Ferraro, Robert Shorten, Mahdi Jalili, Homayoun Hamedmoghadam

Main category: cs.LG

TL;DR: MIND: Message Iteration Network Dismantler eliminates need for handcrafted structural features using attention and message-iteration profiles, trained on synthetic networks, and outperforms state-of-the-art methods on large real networks.

Details

Motivation: Current message-passing GNNs rely on handcrafted structural features, which increases computational cost and introduces bias into otherwise purely data-driven network representations. The paper aims to eliminate this dependency while solving the NP-hard Network Dismantling problem.

Method: Introduces attention mechanism and message-iteration profiles, plus an algorithmic approach to generate structurally diverse training sets of small synthetic networks. Builds an expressive message-passing framework called MIND (Message Iteration Network Dismantler).

Result: MIND trained solely on synthetic networks generalizes to large, unseen real networks with millions of nodes, outperforming state-of-the-art network dismantling methods. The model shows increased efficiency and generalizability.

Conclusion: The proposed MIND framework eliminates the need for handcrafted features while effectively solving NP-hard Network Dismantling. Its efficiency and generalizability can be leveraged beyond dismantling to other complex network problems.

Abstract: The application of message-passing Graph Neural Networks has been a breakthrough for important network science problems. However, the competitive performance often relies on using handcrafted structural features as inputs, which increases computational cost and introduces bias into the otherwise purely data-driven network representations. Here, we eliminate the need for handcrafted features by introducing an attention mechanism and utilizing message-iteration profiles, in addition to an effective algorithmic approach to generate a structurally diverse training set of small synthetic networks. Thereby, we build an expressive message-passing framework and use it to efficiently solve the NP-hard problem of Network Dismantling, virtually equivalent to vital node identification, with significant real-world applications. Trained solely on diversified synthetic networks, our proposed model – MIND: Message Iteration Network Dismantler – generalizes to large, unseen real networks with millions of nodes, outperforming state-of-the-art network dismantling methods. Increased efficiency and generalizability of the proposed model can be leveraged beyond dismantling in a range of complex network problems.

[495] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Ali Taheri, Alireza Taban, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han

Main category: cs.LG

TL;DR: The paper proposes categorizing tokens in SFT data as positive or negative, with negative tokens being explicitly forgotten to improve model performance by focusing learning on useful information.

Details

Motivation: SFT effectiveness depends heavily on data quality and volume; poor quality data can lead to limited performance gains or even degradation. Current approaches don't adequately handle unhelpful or misleading tokens in training data.

Method: Token categorization approach that divides tokens into positive (useful for performance) and negative (lacking essential semantics or misleading). Positive tokens are trained normally, while negative tokens are explicitly forgotten through a forgetting mechanism that shapes a knowledge boundary.

Result: Experiments across diverse benchmarks using various model architectures demonstrate that the forgetting mechanism enhances model performance by helping models learn more precisely what information to focus on.

Conclusion: The proposed token categorization and forgetting approach mitigates SFT’s reliance on data quality, improves learning efficiency by focusing on informative tokens, and establishes knowledge boundaries for more precise learning.

Abstract: Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts – positive and negative tokens – based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments across diverse and well-established benchmarks using various model architectures, demonstrating that this forgetting mechanism enhances model performance.

[496] Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications

Zijian Liu

Main category: cs.LG

TL;DR: The paper analyzes Online Convex Optimization algorithms in heavy-tailed gradient settings, showing classical methods achieve optimal regret without modifications like gradient clipping.

Details

Motivation: Limited results exist for OCO when gradient estimates have heavy tails (finite p-th moment for p∈(1,2]), unlike the well-studied finite variance case. The paper aims to examine classical OCO algorithms in this challenging setting.

Method: Theoretical analysis of existing OCO algorithms (e.g., Online Gradient Descent) under heavy-tailed gradients without algorithmic modifications. Extends analysis to broader settings like smooth OCO and optimistic algorithms.

Result: Establishes new optimal regret bounds for classical OCO methods under heavy-tailed gradients without requiring gradient clipping. These bounds are optimal in all parameters and don’t require knowledge of p.

Conclusion: OCO with heavy tails can be solved effectively using classical algorithms without extra operations like gradient clipping. The results enable first provable optimal convergence for nonsmooth nonconvex optimization under heavy-tailed noise.

Abstract: In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\mathsf{p}$-th central moment for some $\mathsf{p}\in\left(1,2\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable and optimal convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping. Furthermore, we explore broader settings (e.g., smooth OCO) and extend our ideas to optimistic algorithms to handle different cases simultaneously.

[497] CrystalDiT: A Diffusion Transformer for Crystal Generation

Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao

Main category: cs.LG

TL;DR: CrystalDiT is a simple diffusion transformer for crystal structure generation that outperforms complex methods by treating lattice and atomic properties as a unified system, achieving state-of-the-art 8.78% SUN rate on MP-20.

Details

Motivation: To challenge the trend of architectural complexity in crystal structure generation by showing that carefully designed simple architectures can outperform sophisticated alternatives, especially in data-limited scientific domains where complex models are prone to overfitting.

Method: Uses a unified diffusion transformer with a powerful inductive bias that treats lattice and atomic properties as a single interdependent system, combined with periodic table-based atomic representation and balanced training strategy.

Result: Achieves 8.78% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming FlowMM (4.21%) and MatterGen (3.66%). Generates 63.28% unique and novel structures while maintaining comparable stability rates.

Conclusion: Architectural simplicity can be more effective than complexity for materials discovery, especially in data-limited scientific domains where simple, carefully designed architectures outperform sophisticated alternatives prone to overfitting.

Abstract: We present CrystalDiT, a diffusion transformer for crystal structure generation that achieves state-of-the-art performance by challenging the trend of architectural complexity. Instead of intricate, multi-stream designs, CrystalDiT employs a unified transformer that imposes a powerful inductive bias: treating lattice and atomic properties as a single, interdependent system. Combined with a periodic table-based atomic representation and a balanced training strategy, our approach achieves 8.78% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.21%) and MatterGen (3.66%). Notably, CrystalDiT generates 63.28% unique and novel structures while maintaining comparable stability rates, demonstrating that architectural simplicity can be more effective than complexity for materials discovery. Our results suggest that in data-limited scientific domains, carefully designed simple architectures outperform sophisticated alternatives that are prone to overfitting.

[498] Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature

V. S. Usatyuk, D. A. Sapoznikov, S. I. Egorov

Main category: cs.LG

TL;DR: Physics-inspired graph-based classifier using Ising spins on QC-LDPC graphs achieves high accuracy with extreme feature compression for multi-class image classification.

Details

Motivation: Modern CNN feature vectors are computationally expensive and obscure data geometry, while conventional graph-based classifiers degrade on natural multi-class images due to failure to preserve separability on complex feature manifolds.

Method: Treat frozen MobileNetV2 embeddings as Ising spins on sparse Multi-Edge Type QC-LDPC graphs forming Random Bond Ising Model. System tuned to Nishimori temperature where smallest Bethe-Hessian eigenvalue vanishes. Two innovations: spectral-topological correspondence linking graph trapping sets to invariants via Ihara-Bass zeta function, and quadratic-Newton estimator for Nishimori temperature.

Result: Achieves 98.7% top-1 accuracy on ImageNet-10 and 84.92% on ImageNet-100 with three-graph soft ensemble. Compresses 1280-dimensional MobileNetV2 features to 32 dimensions for ImageNet-10 and 64 for ImageNet-100. Hard ensemble increases top-1 by 0.1% while cutting FLOPs by 2.67× vs MobileNetV2; soft ensemble reduces FLOPs by 29× vs ResNet50 with only 1.09% top-1 drop.

Conclusion: Topology-guided LDPC embedding produces highly compressed, accurate classifiers for resource-constrained deployment, with novel contributions in linking trapping sets to topological defects, efficient Nishimori temperature estimation, and demonstrating practical compression-performance tradeoffs.

Abstract: Modern multi-class image classification relies on high-dimensional CNN feature vectors, which are computationally expensive and obscure the underlying data geometry. Conventional graph-based classifiers degrade on natural multi-class images because typical graphs fail to preserve separability on feature manifolds with complex topology. We address this with a physics-inspired pipeline frozen MobileNetV2 embeddings are treated as Ising spins on a sparse Multi-Edge Type QC-LDPC graph forming a Random Bond Ising Model. The system is tuned to its Nishimori temperature identified where the smallest Bethe-Hessian eigenvalue vanishes. Our method rests on two innovations: we prove a spectral-topological correspondence linking graph trapping sets to invariants via the Ihara-Bass zeta function removing these structures boosts top-1 accuracy over four-fold in multi-class settings; we develop a quadratic-Newton estimator for the Nishimori temperature converging in around 9 Arnoldi iterations for a 6-times speedup enabling spectral embedding on scales like ImageNet-100. The resulting graphs compress 1280-dimensional MobileNetV2 features to 32 dimensions for ImageNet10 and 64 for ImageNet-100 We achieve 98.7% top-1 accuracy on ImageNet-10 and 84.92% on ImageNet-100 with a three-graph soft ensemble Versus MobileNetV2 our hard ensemble increases top-1 by 0.1% while cutting FLOPs by 2.67-times compared to ResNet50 the soft ensemble drops top1 by only 1.09% yet reduces FLOPs by 29-times. Novelty lies in (a) rigorously linking trapping sets to topological defects, (b) an efficient Nishimori temperature estimator and (c) demonstrating that topology-guided LDPC embedding produces highly compressed accurate classifiers for resource-constrained deployment

[499] Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking

Xingchen Wang, Feijie Wu, Chenglin Miao, Tianchun Li, Haoyu Hu, Qiming Cao, Jing Gao, Lu Su

Main category: cs.LG

TL;DR: PM-SFL is a privacy-preserving Split Federated Learning framework that uses probabilistic mask training to protect against data reconstruction attacks while maintaining model performance, with personalized masks for data heterogeneity and layer-wise knowledge compensation for system heterogeneity.

Details

Motivation: Traditional Split Federated Learning (SFL) reduces client computation but introduces privacy risks from exchanging intermediate activations and model updates, especially data reconstruction attacks. Existing noise-based defenses degrade model performance, creating a need for a solution that protects privacy without sacrificing utility.

Method: PM-SFL incorporates Probabilistic Mask training to add structured randomness without explicit noise injection. It uses personalized mask learning to adapt submodel structures to each client’s local data for data heterogeneity, and a layer-wise knowledge compensation mechanism for system heterogeneity, enabling adaptive model splitting for clients with varying resources.

Result: Theoretical analysis confirms privacy protection, and experiments on image and wireless sensing tasks show PM-SFL consistently improves accuracy, communication efficiency, and robustness to privacy attacks, with particularly strong performance under data and system heterogeneity conditions.

Conclusion: PM-SFL provides an effective solution to privacy risks in SFL while maintaining model utility, addressing both data and system heterogeneity challenges through its probabilistic mask training, personalized mask learning, and layer-wise knowledge compensation mechanisms.

Abstract: Split Federated Learning (SFL) has emerged as an efficient alternative to traditional Federated Learning (FL) by reducing client-side computation through model partitioning. However, exchanging of intermediate activations and model updates introduces significant privacy risks, especially from data reconstruction attacks that recover original inputs from intermediate representations. Existing defenses using noise injection often degrade model performance. To overcome these challenges, we present PM-SFL, a scalable and privacy-preserving SFL framework that incorporates Probabilistic Mask training to add structured randomness without relying on explicit noise. This mitigates data reconstruction risks while maintaining model utility. To address data heterogeneity, PM-SFL employs personalized mask learning that tailors submodel structures to each client’s local data. For system heterogeneity, we introduce a layer-wise knowledge compensation mechanism, enabling clients with varying resources to participate effectively under adaptive model splitting. Theoretical analysis confirms its privacy protection, and experiments on image and wireless sensing tasks demonstrate that PM-SFL consistently improves accuracy, communication efficiency, and robustness to privacy attacks, with particularly strong performance under data and system heterogeneity.

[500] Personalized Enhanced Federated Multi-View Clustering via Heat-Kernel Tensor Decomposition

Kristina P. Sinaga

Main category: cs.LG

TL;DR: This paper introduces novel mathematical frameworks for multi-view clustering in federated learning, using quantum-inspired heat-kernel metrics and tensor decomposition methods to address data heterogeneity and privacy concerns.

Details

Motivation: The motivation is to address challenges in multi-view clustering within federated learning environments, specifically dealing with data heterogeneity, privacy preservation, and efficient communication while maintaining clustering efficacy.

Method: The method integrates optimization techniques using heat-kernel coefficients as quantum-inspired distance metrics, combined with advanced tensor decomposition methods (PARAFAC2 and Tucker decomposition) to represent high-dimensional multi-view data while preserving inter-view relationships.

Result: The research developed four novel algorithms: E-FKMVC, FedHK-PARAFAC2, FedHK-Tucker, and FedHK-MVC-Person (Personalized FedHK-PARAFAC2), with theoretical analyses providing convergence guarantees, privacy bounds, and complexity validation.

Conclusion: The paper makes significant contributions to federated multi-view clustering through innovative mathematical modeling and algorithm design, addressing critical challenges of data heterogeneity and privacy concerns for enhanced data management and analytics.

Abstract: This paper introduces mathematical frameworks that address the challenges of multi-view clustering in federated learning environments. The objective is to integrate optimization techniques based on new objective functions employing heat-kernel coefficients to replace conventional distance metrics with quantum-inspired measures. The proposed frameworks utilize advanced tensor decomposition methods, specifically, PARAFAC2 and Tucker decomposition to efficiently represent high-dimensional, multi-view data while preserving inter-view relationships. The research has yielded the development of four novel algorithms, an efficient federated kernel multi-view clustering (E-FKMVC) model, FedHK-PARAFAC2, FedHK-Tucker, and FedHK-MVC-Person with PARAFAC2 Decomposition (Personalized FedHK-PARAFAC2). The primary objective of these algorithms is to enhance the efficacy of clustering processes while ensuring the confidentiality and efficient communication in federated learning environments. Theoretical analyses of convergence guarantees, privacy bounds, and complexity are provided to validate the effectiveness of the proposed methods. In essence, this paper makes a significant academic contribution to the field of federated multi-view clustering through its innovative integration of mathematical modeling and algorithm design. This approach addresses the critical challenges of data heterogeneity and privacy concerns, paving the way for enhanced data management and analytics in various contexts.

[501] Graph Learning is Suboptimal in Causal Bandits

Mohammad Shahverdikondori, Jalal Etesami, Negar Kiyavash

Main category: cs.LG

TL;DR: Learning causal parent sets is suboptimal for regret minimization in causal bandits; parent identification conflicts with regret minimization, and bypassing graph recovery leads to better performance.

Details

Motivation: Previous work on causal bandits focused on identifying reward's parents or jointly learning parents while minimizing regret. The paper investigates whether these strategies are optimal, questioning if parent identification is necessary for effective regret minimization.

Method: Proves theoretical results showing regret minimization and parent identification are fundamentally conflicting objectives. Analyzes both known and unknown parent set size regimes, establishes novel regret lower bounds capturing combinatorial structure of action space. Proposes nearly optimal algorithms that bypass graph and parent recovery entirely.

Result: Shows learning parent set is suboptimal for regret minimization. Demonstrates existence of instances where regret minimization and parent identification conflict. Establishes novel regret lower bounds. Experimental results confirm large performance gap between proposed method and existing baselines across various environments.

Conclusion: Parent identification is unnecessary for regret minimization in causal bandits. Bypassing graph and parent recovery leads to nearly optimal algorithms with superior performance compared to existing approaches that focus on parent learning.

Abstract: We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward’s parents and then applying classic bandit methods to them, or jointly learning the parents while minimizing regret. We investigate whether such strategies are optimal. Somewhat counterintuitively, our results show that learning the parent set is suboptimal. We do so by proving that there exist instances where regret minimization and parent identification are fundamentally conflicting objectives. We further analyze both the known and unknown parent set size regimes, establish novel regret lower bounds that capture the combinatorial structure of the action space. Building on these insights, we propose nearly optimal algorithms that bypass graph and parent recovery, demonstrating that parent identification is indeed unnecessary for regret minimization. Experiments confirm that there exists a large performance gap between our method and existing baselines in various environments.

[502] Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

Yoonho Lee, Joseph Boen, Chelsea Finn

Main category: cs.LG

TL;DR: Feedback Descent is a framework that optimizes text artifacts (prompts, code, molecules) using structured textual feedback instead of scalar rewards, enabling directed optimization in text space without modifying model weights.

Details

Motivation: Traditional preference learning methods compress detailed critiques into binary preferences or scalar rewards, creating an information bottleneck. The authors aim to preserve rich textual feedback to enable more effective optimization of text artifacts.

Method: The framework uses evaluators that provide structured textual feedback alongside comparisons, which in-context learning transforms into gradient-like directional information for targeted edits. Optimization occurs purely at inference time without weight updates, making it task-agnostic.

Result: Feedback Descent outperforms state-of-the-art methods including GEPA (prompt optimization), GRPO/REINVENT (reinforcement learning), and specialized graph-based molecular optimizers. On the DOCKSTRING benchmark, it discovers novel drug-like molecules exceeding the 99.9th percentile of a 260,000+ compound database across six protein targets.

Conclusion: Preserving structured textual feedback rather than compressing it to binary preferences enables more effective optimization of text artifacts, demonstrating superior performance across diverse domains including prompt optimization, code generation, and molecular discovery.

Abstract: We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts – prompts, code, and molecules – through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.

[503] Complex variational autoencoders admit Kähler structure

Andrew Gracyk

Main category: cs.LG

TL;DR: Complex VAEs reveal Kähler geometric structure in latent space, with efficient computation of Fisher information metric via Kähler potential derivatives, enabling regularization and smoother representations.

Details

Motivation: To extend Riemannian structure arguments from latent-Euclidean VAEs to complex VAEs with complex latent stages, revealing Kähler geometric structure and enabling efficient computation of Fisher information metrics for decoder geometry.

Method: Derive Fisher information metric for complex VAEs under latent complex Gaussian with trivial relation matrix, propose Kähler potential derivative of complex Gaussian mixtures as efficient proxy to Fisher metric, leverage law of total covariance to bridge potential and metric behavior, and use decoder geometry for latent space regularization with weighted complex volume element sampling.

Result: Complex VAEs demonstrate Kähler geometric structure, efficient computation of Fisher metric via Kähler potential (avoiding large-scale automatic differentiation), and regularization strategies yield consistently smoother representations with fewer semantic outliers at the cost of sample variation.

Conclusion: Complex VAEs naturally admit Kähler geometry, enabling efficient geometric regularization of latent space through decoder geometry and weighted sampling, leading to improved representation quality despite reduced sample variation.

Abstract: It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kähler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kähler potential relation is exactly achieved under relative entropy. We propose a Kähler potential derivative of complex Gaussian mixtures that acts as a rough proxy to the Fisher information metric while still being faithful to the underlying Kähler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. Our methods leverage the law of total covariance to bridge behavior between our potential and the Fisher metric. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.

[504] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Malyaban Bal, Abhronil Sengupta

Main category: cs.LG

TL;DR: GRASP is a lightweight PEFT method that groups token representations and learns shared scaling/shifting vectors, reducing parameters significantly. StochGRASP adds probabilistic parameterization for hardware noise robustness.

Details

Motivation: Parameter-efficient fine-tuning (PEFT) needs to be more scalable and hardware-aware for deployment on edge-based AI hardware with non-ideal inference conditions and hardware-level variability.

Method: GRASP partitions D-dimensional token representations into K « D groups and learns shared scaling/shifting vectors for each group. StochGRASP extends this with Gaussian distributions as perturbations to pre-trained weights and a noise-aware loss function.

Result: GRASP matches or exceeds established PEFT methods (LoRA, BitFit) while achieving order-of-magnitude parameter reduction. StochGRASP consistently outperforms deterministic variants under varying noise levels.

Conclusion: GRASP provides highly parameter-efficient fine-tuning, while StochGRASP enables robust deployment on energy-efficient, noise-prone hardware platforms through probabilistic parameterization.

Abstract: Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K « D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base & RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.

[505] A Context-Aware Temporal Modeling through Unified Multi-Scale Temporal Encoding and Hierarchical Sequence Learning for Single-Channel EEG Sleep Staging

Amirali Vakili, Salar Jahanshiri, Armin Salimi-Badr

Main category: cs.LG

TL;DR: Proposed context-aware interpretable framework for single-channel EEG sleep staging achieves 89.72% accuracy with improved N1 stage detection (61.7% F1-score) using multi-scale features, temporal modeling, and imbalance handling.

Details

Motivation: Sleep disorders are globally prevalent, requiring automated sleep staging. Single-channel EEG is practical but existing methods suffer from class imbalance (especially N1 stage), limited receptive-field modeling, and lack of interpretability in black-box models.

Method: Combines compact multi-scale feature extraction with temporal modeling to capture local and long-range dependencies. Uses class-weighted loss functions and data augmentation for imbalance handling. Segments EEG into sub-epoch chunks and averages softmax probabilities for contextual representation and robustness.

Result: Achieves 89.72% overall accuracy and 85.46% macro-average F1-score on SleepEDF datasets. Notably achieves 61.7% F1-score for challenging N1 stage, showing substantial improvement over previous methods.

Conclusion: The proposed context-aware interpretable framework effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications, with particular success in detecting the difficult N1 stage.

Abstract: Automatic sleep staging is a critical task in healthcare due to the global prevalence of sleep disorders. This study focuses on single-channel electroencephalography (EEG), a practical and widely available signal for automatic sleep staging. Existing approaches face challenges such as class imbalance, limited receptive-field modeling, and insufficient interpretability. This work proposes a context-aware and interpretable framework for single-channel EEG sleep staging, with particular emphasis on improving detection of the N1 stage. Many prior models operate as black boxes with stacked layers, lacking clearly defined and interpretable feature extraction roles.The proposed model combines compact multi-scale feature extraction with temporal modeling to capture both local and long-range dependencies. To address data imbalance, especially in the N1 stage, classweighted loss functions and data augmentation are applied. EEG signals are segmented into sub-epoch chunks, and final predictions are obtained by averaging softmax probabilities across chunks, enhancing contextual representation and robustness.The proposed framework achieves an overall accuracy of 89.72% and a macro-average F1-score of 85.46%. Notably, it attains an F1- score of 61.7% for the challenging N1 stage, demonstrating a substantial improvement over previous methods on the SleepEDF datasets. These results indicate that the proposed approach effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications.

[506] Evaluating Parameter Efficient Methods for RLVR

Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu

Main category: cs.LG

TL;DR: PEFT methods evaluation in RLVR shows structural variants outperform standard LoRA, SVD-based methods fail due to spectral collapse, and extreme parameter reduction hurts reasoning.

Details

Motivation: While PEFT methods like LoRA are commonly used in RLVR (Reinforcement Learning with Verifiable Rewards), the optimal PEFT architecture for RLVR remains unknown, creating a need for systematic evaluation to identify the best approaches.

Method: Comprehensive evaluation of over 12 PEFT methodologies on DeepSeek-R1-Distill models using mathematical reasoning benchmarks, including structural variants (DoRA, AdaLoRA, MiSS) and SVD-informed methods (PiSSA, MiLoRA), with ablation studies and scaling experiments.

Result: Three key findings: 1) Structural variants consistently outperform standard LoRA; 2) SVD-informed methods fail due to spectral collapse from misalignment between principal-component updates and RL optimization; 3) Extreme parameter reduction (VeRA, Rank-1) severely bottlenecks reasoning capacity.

Conclusion: The work challenges default adoption of standard LoRA in RLVR and provides a definitive guide advocating for more exploration of parameter-efficient RL methods, highlighting that structural variants are superior while SVD-based approaches and extreme parameter reduction are problematic.

Abstract: We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

[507] ISOPO: Proximal policy gradients without pi-old

Nilin Abrahamsen

Main category: cs.LG

TL;DR: ISOPO is an efficient single-step method for approximating natural policy gradient, contrasting with multi-step proximal policy methods like GRPO/CISPO that use importance ratio clipping.

Details

Motivation: Existing natural policy gradient approximations require multiple gradient steps with importance ratio clipping, which can be computationally expensive. ISOPO aims to provide an efficient single-step alternative.

Method: ISOPO normalizes log-probability gradients in the Fisher metric before contracting with advantages. A variant transforms microbatch advantages using neural tangent kernel in each layer, applied layer-wise in a single backward pass.

Result: ISOPO achieves natural policy gradient approximation with negligible computational overhead compared to vanilla REINFORCE, requiring only a single gradient step.

Conclusion: ISOPO provides an efficient single-step method for approximating natural policy gradients, offering computational advantages over existing multi-step proximal policy optimization methods.

Abstract: This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.

[508] End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

Main category: cs.LG

TL;DR: TTT-E2E: A test-time training approach for long-context language modeling that uses standard Transformers with sliding-window attention, compressing context into model weights through continual learning at inference time.

Details

Motivation: The paper addresses long-context language modeling by reframing it as a continual learning problem rather than an architecture design challenge. Current approaches often require specialized architectures, but this work aims to achieve long-context capabilities using standard Transformers.

Method: Uses a standard Transformer with sliding-window attention that continues learning at test time via next-token prediction, compressing context into model weights. Improves initialization via meta-learning during training. The approach is end-to-end both at test time (next-token prediction) and training time (meta-learning).

Result: For 3B models trained with 164B tokens, TTT-E2E scales with context length similarly to Transformers with full attention, outperforming Mamba 2 and Gated DeltaNet. It achieves constant inference latency regardless of context length (like RNNs), making it 2.7× faster than full attention for 128K context.

Conclusion: Long-context language modeling can be effectively approached as a continual learning problem using standard architectures with test-time training. The TTT-E2E method combines the scaling benefits of full attention with the efficiency of sliding-window approaches, offering a promising direction for efficient long-context modeling.

Abstract: We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

cs.MA

[509] Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

Shaurya Mallampati, Rashed Shelim, Walid Saad, Naren Ramakrishnan

Main category: cs.MA

TL;DR: LLM agents with game-theoretic reasoning and real-time adaptation achieve 26% better performance than PPO in dynamic multi-agent cooperative environments.

Details

Motivation: While LLMs show strong reasoning in static tasks, their capabilities in dynamic, real-time multi-agent scenarios like cooperative gameplay remain unexplored. The paper aims to bridge this gap by combining LLM-driven agents with strategic reasoning in continuously adapting environments.

Method: Proposes a framework combining LLM-driven agents with game-theoretic principles (belief consistency, Nash equilibrium) for dynamic multi-agent coordination. Includes real-time strategy refinement and adaptive feedback mechanisms that allow agents to continuously adjust policies based on immediate contextual interactions.

Result: Achieves up to 26% improvement in return over PPO baselines in high-noise environments while maintaining real-time latency under 1.05 milliseconds. Improves collaboration efficiency, task completion rates, and flexibility in dynamic cooperative settings.

Conclusion: Game-theoretic guidance integrated with real-time feedback enhances LLM performance in dynamic multi-agent systems, fostering more resilient and flexible strategic agents capable of real-time adaptation in cooperative environments.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities across mathematical, strategic, and linguistic tasks, yet little is known about how well they reason in dynamic, real-time, multi-agent scenarios, such as collaborative environments in which agents continuously adapt to each other’s behavior, as in cooperative gameplay settings. In this paper, we bridge this gap by combining LLM-driven agents with strategic reasoning and real-time adaptation in cooperative, multi-agent environments grounded in game-theoretic principles such as belief consistency and Nash equilibrium. The proposed framework applies broadly to dynamic scenarios in which agents coordinate, communicate, and make decisions in response to continuously changing conditions. We provide real-time strategy refinement and adaptive feedback mechanisms that enable agents to dynamically adjust policies based on immediate contextual interactions, in contrast to previous efforts that evaluate LLM capabilities in static or turn-based settings. Empirical results show that our method achieves up to a 26% improvement in return over PPO baselines in high-noise environments, while maintaining real-time latency under 1.05 milliseconds. Our approach improves collaboration efficiency, task completion rates, and flexibility, illustrating that game-theoretic guidance integrated with real-time feedback enhances LLM performance, ultimately fostering more resilient and flexible strategic multi-agent systems.

cs.MM

[510] GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, Hongtao Xie

Main category: cs.MM

TL;DR: GestureHYDRA: A hybrid-modality diffusion transformer system for generating co-speech gestures with semantically explicit hand gestures, using a cascaded retrieval-augmented generation strategy for better semantic gesture activation.

Details

Motivation: Most previous co-speech gesture synthesis works neglect hand gestures with explicit and essential semantics that deliver instructional information. The paper aims to generate co-speech gestures with specific hand gesture activation that carries more meaningful information than common body movements.

Method: 1) Build a high-quality 3D human body movement dataset with semantically explicit hand gestures from live streamers. 2) Develop GestureHYDRA - a hybrid-modality diffusion transformer architecture with motion-style injective transformer layers. 3) Implement cascaded retrieval-augmented generation using a semantic gesture repository and adaptive audio-gesture synchronization mechanism.

Result: Quantitative and qualitative experiments show superior performance over all counterparts. The system achieves better semantic gesture activation and production efficiency through the proposed architecture and strategies.

Conclusion: The proposed GestureHYDRA system effectively generates co-speech gestures with semantically explicit hand gestures, addressing the gap in previous works that neglected meaningful hand gesture semantics. The hybrid-modality approach and retrieval-augmented strategy significantly improve gesture modeling and activation capabilities.

Abstract: While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency. Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts. The project page can be found at https://mumuwei.github.io/GestureHYDRA/.

eess.AS

[511] Regularized autoregressive modeling and its application to audio signal reconstruction

Ondřej Mokrý, Pavel Rajmic

Main category: eess.AS

TL;DR: Proposes a generic framework for regularizing/constraining autoregressive models in signal processing, with applications to audio declipping and dequantization.

Details

Motivation: Existing AR modeling approaches lack an encompassing framework for regularization/constraints, despite attempts to incorporate prior information or achieve numerical stabilization.

Method: Develops a comprehensive modeling framework with related optimization problem and algorithm, discusses computational demands and convergence improvements.

Result: Method shows competitiveness in musical signal declipping and superiority in speech declipping compared to state-of-the-art, including a heuristic GLP algorithm.

Conclusion: Proposed framework successfully addresses limitations of existing AR regularization approaches and demonstrates practical effectiveness in audio restoration tasks.

Abstract: Autoregressive (AR) modeling is invaluable in signal processing, in particular in speech and audio fields. Attempts in the literature can be found that regularize or constrain either the time-domain signal values or the AR coefficients, which is done for various reasons, including the incorporation of prior information or numerical stabilization. Although these attempts are appealing, an encompassing and generic modeling framework is still missing. We propose such a framework and the related optimization problem and algorithm. We discuss the computational demands of the algorithm and explore the effects of various improvements on its convergence speed. In the experimental part, we demonstrate the usefulness of our approach on the audio declipping and dequantization problems. We compare its performance against state-of-the-art methods and demonstrate the competitiveness of the proposed method in declipping musical signals, and its superiority in declipping speech. The evaluation includes a heuristic algorithm of generalized linear prediction (GLP), a strong competitor which has only been presented as a patent and is new in the scientific community.

eess.IV

[512] Leveraging Machine Learning for Early Detection of Lung Diseases

Bahareh Rahmani, Harsha Reddy Bindela, Rama Kanth Reddy Gosula, Krishna Yedubati, Mohammad Amir Salari, Leslie Hinyard, Payam Norouzzadeh, Eli Snir, Martin Schoen

Main category: eess.IV

TL;DR: Deep learning models (CNNs, VGG16, InceptionV3, EfficientNetB0) achieve high accuracy for diagnosing respiratory diseases from chest X-rays, enabling rapid, non-invasive diagnostics in resource-limited areas.

Details

Motivation: To create rapid, accurate, non-invasive diagnostic solutions for respiratory diseases (COVID-19, lung cancer, pneumonia) that can improve patient outcomes, especially in areas with limited access to radiologists and healthcare resources.

Method: Combined traditional image processing with advanced neural networks; trained and validated multiple deep learning models including CNNs, VGG16, InceptionV3, and EfficientNetB0 on chest X-ray images.

Result: Models achieved high accuracy, precision, recall, and F1 scores, demonstrating reliability and potential for real-world diagnostic applications.

Conclusion: Deep learning methods can significantly enhance respiratory disease diagnosis from chest X-rays, establishing a predictive and preventive healthcare paradigm with practical applications in resource-limited settings.

Abstract: A combination of traditional image processing methods with advanced neural networks concretes a predictive and preventive healthcare paradigm. This study offers rapid, accurate, and non-invasive diagnostic solutions that can significantly impact patient outcomes, particularly in areas with limited access to radiologists and healthcare resources. In this project, deep learning methods apply in enhancing the diagnosis of respiratory diseases such as COVID-19, lung cancer, and pneumonia from chest x-rays. We trained and validated various neural network models, including CNNs, VGG16, InceptionV3, and EfficientNetB0, with high accuracy, precision, recall, and F1 scores to highlight the models’ reliability and potential in real-world diagnostic applications.

[513] Targeted Semantic Segmentation of Himalayan Glacial Lakes Using Time-Series SAR: Towards Automated GLOF Early Warning

Pawan Adhikari, Satish Raj Regmi, Hari Ram Shrestha

Main category: eess.IV

TL;DR: Automated deep learning pipeline using Sentinel-1 SAR time-series for targeted monitoring of high-risk Himalayan glacial lakes, achieving 0.9130 IoU with temporal-first training strategy and operational Docker architecture.

Details

Motivation: Existing glacial lake monitoring approaches either prioritize spatial coverage for generalistic models or rely on optical imagery hampered by cloud coverage, creating limitations for effective early warning systems.

Method: End-to-end automated pipeline using Sentinel-1 SAR time-series with “temporal-first” training strategy on U-Net with EfficientNet-B3 backbone, trained on curated dataset of 4 Himalayan lakes, plus Dockerized operational architecture with ASF Search API and RESTful endpoints.

Result: Model achieves IoU of 0.9130, validating the success of the temporal-first strategy, and provides operational engineering architecture for automated monitoring.

Conclusion: The system shifts paradigm from static mapping to dynamic automated early warning, providing scalable foundation for future Early Warning Systems development.

Abstract: Glacial Lake Outburst Floods (GLOFs) are one of the most devastating climate change induced hazards. Existing remote monitoring approaches often prioritise maximising spatial coverage to train generalistic models or rely on optical imagery hampered by persistent cloud coverage. This paper presents an end-to-end, automated deep learning pipeline for the targeted monitoring of high-risk Himalayan glacial lakes using time-series Sentinel-1 SAR. We introduce a “temporal-first” training strategy, utilising a U-Net with an EfficientNet-B3 backbone trained on a curated dataset of a cohort of 4 lakes (Tsho Rolpa, Chamlang Tsho, Tilicho and Gokyo Lake). The model achieves an IoU of 0.9130 validating the success and efficacy of the “temporal-first” strategy required for transitioning to Early Warning Systems. Beyond the model, we propose an operational engineering architecture: a Dockerised pipeline that automates data ingestion via the ASF Search API and exposes inference results via a RESTful endpoint. This system shifts the paradigm from static mapping to dynamic and automated early warning, providing a scalable architectural foundation for future development in Early Warning Systems.

[514] The OCR-PT-CT Project: Semi-Automatic Recognition of Ancient Egyptian Hieroglyphs Based on Metric Learning

David Fuentes-Jimenez, Daniel Pizarro, Álvaro Hernández, Adin Bartoli, César Guerra Méndez, Laura de Diego-Otón, Sira Palazuelos-Cagigas, Carlos Gracia Zamacona

Main category: eess.IV

TL;DR: The OCR-PT-CT project develops a hieroglyph recognition system using Deep Metric Learning that achieves 97.70% accuracy, outperforming Mobilenet (93.87%), especially for underrepresented classes. The system transcribes hieroglyphs into Gardiner’s codes and integrates with the MORTEXVAR dataset.

Details

Motivation: To transform Egyptology through digital humanities by creating an automated system for recognizing and transcribing ancient Egyptian hieroglyphs from Coffin Texts and Pyramid Texts, addressing the challenges of manual analysis and enabling better integration with existing datasets.

Method: Two approaches were tested: 1) Mobilenet neural network trained on 140 hieroglyph classes, and 2) a novel Deep Metric Learning approach designed for better flexibility with new or data-limited signs. The system processes images from de Buck’s Coffin Texts and Allen’s Pyramid Texts, identifies hieroglyphs, and transcribes them into Gardiner’s codes.

Result: Deep Metric Learning achieved superior performance with 97.70% accuracy compared to Mobilenet’s 93.87%. It showed better handling of class imbalance and underrepresented classes, recognized more hieroglyphs, and demonstrated greater adaptability for new signs with limited training data.

Conclusion: The final system adopts Deep Metric Learning as the default classifier due to its superior accuracy, better performance under class imbalance, and greater flexibility for recognizing new or underrepresented hieroglyph classes. The web tool organizes recognized hieroglyphs by spells and witnesses, storing data in CSV format for integration with the MORTEXVAR dataset.

Abstract: Digital humanities are significantly transforming how Egyptologists study ancient Egyptian texts. The OCR-PT-CT project proposes a recognition method for hieroglyphs based on images of Coffin Texts (CT) from Adriaan de Buck (1935-1961) and Pyramid Texts (PT) from Middle Kingdom coffins (James Allen, 2006). The system identifies hieroglyphs and transcribes them into Gardiner’s codes. A web tool organizes them by spells and witnesses, storing the data in CSV format for integration with the MORTEXVAR dataset, which collects Coffin Texts with metadata, transliterations, and translations for research. Recognition has been addressed in two ways: a Mobilenet neural network trained on 140 hieroglyph classes achieved 93.87 % accuracy but struggled with underrepresented classes. A novel Deep Metric Learning approach improves flexibility for new or data-limited signs, achieving 97.70 % accuracy and recognizing more hieroglyphs. Due to its superior performance under class imbalance and adaptability, the final system adopts Deep Metric Learning as the default classifier.

[515] Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission

Xiangyu Chen, Jixiang Luo, Jingyu Xu, Fangqiu Yi, Chi Zhang, Xuelong Li

Main category: eess.IV

TL;DR: GVC achieves extreme video compression (0.02%) using generative models, shifting computation burden to receiver for bandwidth-constrained applications.

Details

Motivation: To achieve extreme video compression rates (as low as 0.01%) for bandwidth-constrained environments like emergency rescue, remote surveillance, and mobile edge computing, while maintaining perception-centric communication.

Method: Generative Video Compression (GVC) framework that encodes video into extremely compact representations and uses generative priors at the receiver to synthesize high-quality video from minimal transmitted information, with compression-computation trade-off strategies for practical deployment.

Result: Achieves compression rates as low as 0.02% in some cases, enabling video communication in bandwidth-constrained environments while maintaining practical deployment on consumer-grade GPUs.

Conclusion: GVC provides a viable path toward an effective, efficient, scalable, and practical video communication paradigm for extreme bandwidth-constrained applications through generative model-based compression.

Abstract: Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of video compression by leveraging modern generative video models to achieve extreme compression rates while preserving a perception-centric, task-oriented communication paradigm, corresponding to Level C of the Shannon-Weaver model. Besides, How we trade computation for compression rate or bandwidth? GVC answers this question by shifting the burden from transmission to inference: it encodes video into extremely compact representations and delegates content reconstruction to the receiver, where powerful generative priors synthesize high-quality video from minimal transmitted information. Is GVC practical and deployable? To ensure practical deployment, we propose a compression-computation trade-off strategy, enabling fast inference on consume-grade GPUs. Within the AI Flow framework, GVC opens new possibility for video communication in bandwidth- and resource-constrained environments such as emergency rescue, remote surveillance, and mobile edge computing. Through empirical validation, we demonstrate that GVC offers a viable path toward a new effective, efficient, scalable, and practical video communication paradigm.

[516] Automated Classification of First-Trimester Fetal Heart Views Using Ultrasound-Specific Self-Supervised Learning

Youssef Megahed, Aylin Erman, Robin Ducharme, Mark C. Walker, Steven Hawken, Adrian D. C. Chan

Main category: eess.IV

TL;DR: Self-supervised ultrasound foundation model (USF-MAE) outperforms supervised baselines for first-trimester fetal heart view classification, achieving 90.57% accuracy without complex preprocessing.

Details

Motivation: Congenital heart disease is a major cause of neonatal morbidity/mortality. Early detection via first-trimester fetal echocardiography is challenging due to small structures, low SNR, and operator variability, requiring automated analysis solutions.

Method: USF-MAE pretrained using masked autoencoding on 370K+ unlabelled ultrasound images across 40+ anatomical regions, then fine-tuned on 6,720 first-trimester fetal echocardiography images to classify 5 view categories. Compared against ResNet-18, ResNet-50, and ImageNet-pretrained ViT-B/16 baselines.

Result: USF-MAE achieved best performance: 90.57% accuracy, 91.15% precision, 90.57% recall, 90.71% F1-score. Outperformed strongest baseline (ResNet-18) by +2.03% accuracy and +1.98% F1-score. Showed robust performance without aggressive preprocessing and better discrimination of non-diagnostic frames.

Conclusion: Self-supervised ultrasound foundation models offer superior performance for early fetal heart view classification compared to supervised CNNs and natural image-pretrained transformers, enabling more reliable automated analysis for congenital heart disease screening.

Abstract: Congenital heart disease remains the most common congenital anomaly and a leading cause of neonatal morbidity and mortality. Although first-trimester fetal echocardiography offers an opportunity for earlier detection, automated analysis at this stage is challenging due to small cardiac structures, low signal-to-noise ratio, and substantial inter-operator variability. In this work, we evaluate a self-supervised ultrasound foundation model, USF-MAE, for first-trimester fetal heart view classification. USF-MAE is pretrained using masked autoencoding modelling on more than 370,000 unlabelled ultrasound images spanning over 40 anatomical regions and is subsequently fine-tuned for downstream classification. As a proof of concept, the pretrained Vision Transformer encoder was fine-tuned on an open-source dataset of 6,720 first-trimester fetal echocardiography images to classify five categories: aorta, atrioventricular flows, V sign, X sign, and Other. Model performance was benchmarked against supervised convolutional neural network baselines (ResNet-18 and ResNet-50) and a Vision Transformer (ViT-B/16) model pretrained on natural images (ImageNet-1k). All models were trained and evaluated using identical preprocessing, data splits, and optimization protocols. On an independent test set, USF-MAE achieved the highest performance across all evaluation metrics, with 90.57% accuracy, 91.15% precision, 90.57% recall, and 90.71% F1-score. This represents an improvement of +2.03% in accuracy and +1.98% in F1-score compared with the strongest baseline, ResNet-18. The proposed approach demonstrated robust performance without reliance on aggressive image preprocessing or region-of-interest cropping and showed improved discrimination of non-diagnostic frames.

[517] An Adaptive, Disentangled Representation for Multidimensional MRI Reconstruction

Ruiyang Zhao, Fan Lam

Main category: eess.IV

TL;DR: A novel MRI reconstruction method using disentangled feature representation with latent diffusion models achieves improved performance without task-specific training.

Details

Motivation: To address the challenge of reconstructing multidimensional MRI data when limited task-specific training data is available, by developing a representation that can exploit feature correlations and incorporate pre-learned priors.

Method: Uses encoder-decoder network with style-based decoder for feature disentanglement, latent diffusion models for feature space constraints, zero-shot self-supervised learning adaptation, and subspace modeling integrated into new reconstruction formulations.

Result: Achieved improved performance over state-of-the-art methods on accelerated T1 and T2 parameter mapping, without requiring task-specific supervised training or fine-tuning.

Conclusion: Provides a new strategy for learning-based multidimensional image reconstruction that works effectively with limited task-specific training data through disentangled feature representation and pre-learned priors.

Abstract: We present a new approach for representing and reconstructing multidimensional magnetic resonance imaging (MRI) data. Our method builds on a novel, learned feature-based image representation that disentangles different types of features, such as geometry and contrast, into distinct low-dimensional latent spaces, enabling better exploitation of feature correlations in multidimensional images and incorporation of pre-learned priors specific to different feature types for reconstruction. More specifically, the disentanglement was achieved via an encoderdecoder network and image transfer training using large public data, enhanced by a style-based decoder design. A latent diffusion model was introduced to impose stronger constraints on distinct feature spaces. New reconstruction formulations and algorithms were developed to integrate the learned representation with a zero-shot selfsupervised learning adaptation and subspace modeling. The proposed method has been evaluated on accelerated T1 and T2 parameter mapping, achieving improved performance over state-of-the-art reconstruction methods, without task-specific supervised training or fine-tuning. This work offers a new strategy for learning-based multidimensional image reconstruction where only limited data are available for problem-specific or task-specific training.

[518] Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

Yudi Sang, Yanzhen Liu, Sutuke Yibulayimu, Yunning Wang, Benjamin D. Killeen, Mingxu Liu, Ping-Cheng Ku, Ole Johannsen, Karol Gotkowski, Maximilian Zenk, Klaus Maier-Hein, Fabian Isensee, Peiyan Yue, Yi Wang, Haidong Yu, Zhaohong Pan, Yutong He, Xiaokun Liang, Daiqi Liu, Fuxin Fan, Artur Jurgas, Andrzej Skalski, Yuxi Ma, Jing Yang, Szymon Płotka, Rafał Litka, Gang Zhu, Yingchun Song, Mathias Unberath, Mehran Armand, Dan Ruan, S. Kevin Zhou, Qiyong Cao, Chunpeng Zhao, Xinbao Wu, Yu Wang

Main category: eess.IV

TL;DR: The PENGWIN challenge benchmarked pelvic fracture segmentation algorithms on CT and X-ray images, with top CT models achieving 0.930 IoU but X-ray models only reaching 0.774 IoU, highlighting challenges in projection imaging and the need for interactive approaches.

Details

Motivation: Pelvic fracture segmentation is crucial for trauma diagnosis and surgical planning, but remains challenging due to complex anatomy and imaging limitations. The PENGWIN challenge aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms.

Method: Organized as a MICCAI 2024 satellite event, the challenge used a diverse dataset of 150 CT scans from multiple centers and simulated X-ray images generated via DeepDRR. 16 teams submitted algorithms evaluated under a rigorous multi-metric testing scheme.

Result: Top CT algorithm achieved 0.930 average fragment-wise IoU (satisfactory accuracy), while best X-ray algorithm reached only 0.774 IoU (promising but insufficient for intra-operative use). The challenge revealed methodological diversity in algorithm design and uncertainties in fragment definition.

Conclusion: While CT segmentation shows satisfactory performance, X-ray segmentation remains challenging due to fragment overlap. Interactive segmentation approaches integrating human decision-making may be essential for improving model reliability and clinical applicability.

Abstract: The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm achieved an IoU of 0.774, which is promising but not yet sufficient for intra-operative decision-making, reflecting the inherent challenges of fragment overlap in projection imaging. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.

[519] Hybrid Learning: A Novel Combination of Self-Supervised and Supervised Learning for Joint MRI Reconstruction and Denoising in Low-Field MRI

Haoyang Pei, Nikola Janjuvsevic, Renqing Luo, Ding Xia, Xiang Xu, William Moore, Yao Wang, Hersh Chandarana, Li Feng

Main category: eess.IV

TL;DR: Hybrid learning combines self-supervised and supervised learning for MRI reconstruction when only low-SNR training references are available, improving image quality over existing methods.

Details

Motivation: Conventional supervised learning requires high-quality, high-SNR references that are difficult to obtain, especially in low-field MRI. Self-supervised learning doesn't need references but performs poorly with low baseline SNR. There's a need for effective reconstruction methods when only low-SNR training data is available.

Method: Two-stage training framework: 1) Self-supervised learning on fully sampled low-SNR data to generate higher-quality pseudo-references, 2) Supervised learning using these pseudo-references as targets to reconstruct and denoise undersampled noisy data. Works for both Cartesian and non-Cartesian acquisitions.

Result: Hybrid learning consistently improved image quality over standard self-supervised learning and supervised learning with noisy references across different acceleration rates, noise levels, and field strengths. Achieved higher SSIM and lower NMSE in simulated and real low-field MRI experiments in lung and brain.

Conclusion: Hybrid learning provides an effective solution for training deep MRI reconstruction models without high-SNR references. It improves image quality in low-SNR settings, particularly for low-field MRI, and could enable broader clinical adoption of deep learning-based reconstruction methods.

Abstract: Deep learning has demonstrated strong potential for MRI reconstruction. However, conventional supervised learning requires high-quality, high-SNR references for network training, which are often difficult or impossible to obtain in different scenarios, particularly in low-field MRI. Self-supervised learning provides an alternative by removing the need for training references, but its reconstruction performance can degrade when the baseline SNR is low. To address these limitations, we propose hybrid learning, a two-stage training framework that integrates self-supervised and supervised learning for joint MRI reconstruction and denoising when only low-SNR training references are available. Hybrid learning is implemented in two sequential stages. In the first stage, self-supervised learning is applied to fully sampled low-SNR data to generate higher-quality pseudo-references. In the second stage, these pseudo-references are used as targets for supervised learning to reconstruct and denoise undersampled noisy data. The proposed technique was evaluated in multiple experiments involving simulated and real low-field MRI in the lung and brain at different field strengths. Hybrid learning consistently improved image quality over both standard self-supervised learning and supervised learning with noisy training references at different acceleration rates, noise levels, and field strengths, achieving higher SSIM and lower NMSE. The hybrid learning approach is effective for both Cartesian and non-Cartesian acquisitions. Hybrid learning provides an effective solution for training deep MRI reconstruction models in the absence of high-SNR references. By improving image quality in low-SNR settings, particularly for low-field MRI, it holds promise for broader clinical adoption of deep learning-based reconstruction methods.

[520] GroundGazer: Camera-based indoor localization of mobile robots with millimeter accuracy at low cost

Sven Hinderer, Jakob Hüsken, Bohan Sun, Bin Yang

Main category: eess.IV

TL;DR: GroundGazer: A low-cost, high-accuracy indoor localization system for autonomous mobile robots using monocular camera and chessboard floor pattern

Details

Motivation: Existing high-accuracy indoor localization systems (LiDAR, tachymeters, motion capture) are very expensive, creating a need for affordable alternatives with millimeter positioning accuracy

Method: Uses a monocular fisheye camera, chessboard floor pattern, and optional laser diode to estimate robot position and heading through visual pattern recognition

Result: Achieves millimeter positioning accuracy and sub-degree heading accuracy for autonomous mobile robots

Conclusion: GroundGazer provides a simple, low-cost, portable, robust, and scalable solution for indoor localization that is easy to set up and potentially extendable to 3D position and orientation estimation

Abstract: Highly accurate indoor localization systems with mm positioning accuracy are currently very expensive. They include range finders (such as LiDAR), tachymeters, and motion capture systems relying on multiple high-end cameras. In this work, we introduce a high-accuracy, planar indoor localization system named GroundGazer (GG) for autonomous mobile robots (AMRs). GG estimates the AMR’s position with mm and its heading with sub-degree accuracy. The system requires only a monocular (fisheye) camera, a chessboard floor, and an optional laser diode. Our system is simple and low-cost, easy to set up, portable, robust, scalable to large areas and robot swarms, and potentially extendable to 3D position and orientation estimation.

[521] Single-View Tomographic Reconstruction Using Learned Primal Dual

Sean Breckling, Matthew Swan, Keith D. Tan, Derek Wingard, Brandon Baldonado, Yoohwan Kim, Ju-Yeon Jo, Evan Scott, Jordan Pillow

Main category: eess.IV

TL;DR: The paper investigates the Learned Primal Dual method’s performance for single-view tomographic reconstruction of axially-symmetric targets, comparing it against traditional numerical inversion methods.

Details

Motivation: To explore how the LPD method performs in extreme tomographic scenarios - specifically single-view reconstructions of axially-symmetric targets, where traditional methods face significant challenges due to limited data.

Method: The study uses two modalities: low-divergence/parallel X-rays and cone-beam X-ray imaging. Training data is generated using closed-form integral transforms or physics-based ray-tracing software, then corrupted with blur and noise. Performance is evaluated against common numerical inversion methodologies.

Result: The paper presents results comparing LPD performance against traditional numerical inversion methods for single-view tomographic reconstruction, though specific quantitative results are not provided in the abstract.

Conclusion: The study investigates LPD’s potential for extreme tomographic scenarios, suggesting it may offer advantages over traditional methods for single-view reconstructions of axially-symmetric targets, though full conclusions would require examining the complete paper results.

Abstract: The Learned Primal Dual (LPD) method has shown promising results in various tomographic reconstruction modalities, particularly under challenging acquisition restrictions such as limited viewing angles or a limited number of views. We investigate the performance of LPD in a more extreme case: single-view tomographic reconstructions of axially-symmetric targets. This study considers two modalities: the first assumes low-divergence or parallel X-rays. The second models a cone-beam X-ray imaging testbed. For both modalities, training data is generated using closed-form integral transforms, or physics-based ray-tracing software, then corrupted with blur and noise. Our results are then compared against common numerical inversion methodologies.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

[5] MiMo-Audio: Audio Language Models are Few-Shot Learners

[6] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

[7] Noise-Driven Persona Formation in Reflexive Neural Language Generation

[8] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

[9] Emergent World Beliefs: Exploring Transformers in Stochastic Games

[10] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

[11] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

[12] Break Out the Silverware – Semantic Understanding of Stored Household Items

[13] Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

[14] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

[15] Fun-Audio-Chat Technical Report

[16] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

[17] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

[18] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

[19] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

[20] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

[21] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

[22] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

[23] Efficient Context Scaling with LongCat ZigZag Attention

[24] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

[25] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

[26] WISE: Web Information Satire and Fakeness Evaluation

[27] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

[28] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

[29] HY-MT1.5 Technical Report

[30] Training a Huggingface Model on AWS Sagemaker (Without Tears)

[31] Activation Steering for Masked Diffusion Language Models

[32] Large Emotional World Model

[33] Training Report of TeleChat3-MoE

[34] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

[35] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

[36] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

[37] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

[38] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

[39] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

[40] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

[41] World model inspired sarcasm reasoning with large language model agents

[42] Skim-Aware Contrastive Learning for Efficient Document Representation

[43] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

[44] Cleaning English Abstracts of Scientific Publications

[45] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

[46] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

[47] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

[48] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

[49] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities

[50] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

[51] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

[52] Do Large Language Models Know What They Are Capable Of?

[53] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

[54] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

[55] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

[56] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

[57] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

[58] Practising responsibility: Ethics in NLP as a hands-on course

[59] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

[60] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

[61] Big AI is accelerating the metacrisis: What can we do?

[62] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

[63] mHC: Manifold-Constrained Hyper-Connections

[64] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

[65] Classifying long legal documents using short random chunks

[66] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

[67] Modeling Language as a Sequence of Thoughts

[68] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

[69] CascadeNS: Confidence-Cascaded Neurosymbolic Model for Sarcasm Detection

[70] LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

[71] Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering

[72] Automatic identification of diagnosis from hospital discharge letters via weakly-supervised Natural Language Processing

[73] Bielik 7B v0.1: A Polish Language Model – Development, Insights, and Evaluation

[74] Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

[75] Quantifying Positional Biases in Text Embedding Models

[76] Large Multimodal Models for Low-Resource Languages: A Survey

[77] ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting