Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 110]
cs.CV [Total: 182]
cs.AI [Total: 48]
cs.SD [Total: 14]
cs.LG [Total: 154]
cs.MA [Total: 1]
cs.MM [Total: 1]
eess.AS [Total: 1]
eess.IV [Total: 10]

cs.CL

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Zahra Abedi, Richard M. K. van Dijk, Gijs Wijnholds, Tessa Verhoef

Main category: cs.CL

TL;DR: Researchers developed an automated pipeline combining OCR, LLM-based interpretation, and database linking to digitize and harmonize historical professor records from Leiden University books with existing database records.

Details

Motivation: To create an automated system for digitizing and analyzing historical biographic data from Leiden University professor records (1575-1815) and harmonizing it with existing high-quality database records, addressing challenges in digital humanities research.

Method: Used OCR techniques, generative AI with decoding constraints for structured data extraction, and database linkage methods to process typewritten historical records into digital format and link them with existing database records.

Result: OCR achieved CER 1.08% and WER 5.06%; JSON extraction accuracy was 63% from OCR text and 65% from annotated OCR; record linkage achieved 94% accuracy with annotated JSON and 81% with OCR-derived JSON, showing generative AI can correct low OCR performance.

Conclusion: The study contributes to digital humanities by providing an automated pipeline for interpreting historical documents, addressing layout variability and terminology differences, and demonstrating the applicability of advanced generative AI models for historical data harmonization.

Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano

Main category: cs.CL

TL;DR: CAT is a framework for evaluating LLMs by visualizing the interplay between accuracy and response consistency using CAR curves and CORE index metrics.

Details

Motivation: Current LLM evaluation focuses mainly on accuracy/benchmark scores, but consistency is crucial for real-world applications. Both dimensions should be evaluated independently AND interdependently for nuanced assessment.

Method: Introduces CAR curves (Consistency-Accuracy Relation) showing how accuracy varies with increasing consistency requirements via Minimum-Consistency Accuracy metric. Also proposes CORE index (Consistency-Oriented Robustness Estimate) combining area and shape of CAR curve to quantify accuracy-consistency trade-off.

Result: Demonstrated framework across diverse generalist and domain-specific LLMs on multiple MC benchmarks. Framework can be extended beyond MC tasks to long-form, open-ended evaluations through adaptable scoring functions.

Conclusion: CAT provides a comprehensive framework for evaluating LLMs by considering both accuracy and consistency interdependently, offering visualization tools and quantitative metrics for better assessment of model reliability in real-world applications.

Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanghui Wang, Jinze Yu, Xing Zhang, Dayuan Jiang, Yin Song, Tomal Deb, Xuefeng Liu, Peiyang He

Main category: cs.CL

TL;DR: A framework for evaluating and improving consistency in LLM-generated structured outputs using STED (Semantic Tree Edit Distance) and consistency scoring.

Details

Motivation: LLMs are increasingly used for structured data generation, but output consistency is critical for production applications. Current evaluation metrics don't adequately balance semantic flexibility with structural strictness for JSON outputs.

Method: Combines: (1) STED - a novel similarity metric balancing semantic flexibility with structural strictness for comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability.

Result: STED achieves superior performance (0.86-0.90 similarity for semantic equivalents, 0.0 for structural breaks) compared to existing metrics. Benchmarking six LLMs reveals significant variations: Claude-3.7-Sonnet shows exceptional consistency, while others like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning.

Conclusion: The framework enables practical applications including targeted model selection, iterative prompt refinement, and diagnostic analysis. Provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Jahidul Islam, Md Ataullha, Saiful Azad

Main category: cs.CL

TL;DR: BanglaCodeAct: An agent-based framework using multilingual LLMs with Thought-Code-Observation loop for Bangla-to-Python code generation, achieving state-of-the-art performance without task-specific fine-tuning.

Details

Motivation: LLMs excel at English-to-code generation but this progress hasn't extended to low-resource languages like Bangla, creating a gap in accessibility for non-English speakers.

Method: BanglaCodeAct framework uses multi-agent prompting and iterative self-correction with open-source multilingual LLMs in a Thought-Code-Observation loop for dynamic code generation, testing, and refinement from Bangla instructions.

Result: Qwen3-8B with BanglaCodeAct achieves best performance: 94.0% pass@1 on development set and 71.6% on blind test set of mHumanEval dataset for Bangla NL2Code, establishing new benchmark.

Conclusion: Agent-based reasoning enables reliable code generation in low-resource languages without task-specific fine-tuning, establishing new state-of-the-art for Bangla-to-Python translation.

Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0% on the development set and 71.6% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at github.com/jahidulzaid/PyBanglaCodeActAgent.

[5] MiMo-Audio: Audio Language Models are Few-Shot Learners

Xiaomi LLM-Core Team, :, Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xingchen Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wenshan Huang, Wenyu Yang, Yilin Jiang, Yixin Yang, Yuanyuan Tian, Yue Ma, Yue Yu, Zihan Zhang, Zihao Yue, Bangjun Xiao, Bingquan Xia, Bofei Gao, Bowen Ye, Can Cai, Chang Liu, Chenhong He, Chunan Li, Dawei Zhu, Duo Zhang, Fengyuan Shi, Guoan Wang, Hailin Zhang, Hanglong Lv, Hanyu Li, Hao Tian, Heng Qu, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianguang Zuo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Linghao Zhang, Meng Chen, Nuo Chen, Peng Zhang, Qianli Chen, Qiantong Wang, Rang Li, Shaohui Liu, Shengfan Wang, Shicheng Li, Shihua Yu, Shijie Cao, Shimao Chen, Shuhao Gu, Weikun Wang, Wenhan Ma, Xiangwei Deng, Xing Yong, Xing Zhang, Xu Wang, Yifan Song, Yihao Zhao, Yingbo Zhao, Yizhao Gao, Yu Cheng, Yu Tu, Yudong Wang, Zhaojun Huang, Zhengju Tang, Zhenru Lin, Zhichao Song, Zhipeng Xu, Zhixian Zheng, Zihan Jiang

Main category: cs.CL

TL;DR: MiMo-Audio scales audio language model pretraining to 100M+ hours, achieving emergent few-shot learning and SOTA performance across diverse audio tasks without task-specific fine-tuning.

Details

Motivation: Current audio models require task-specific fine-tuning, unlike humans who generalize from few examples. The paper aims to bring GPT-3's scaling paradigm to audio for better generalization.

Method: Scale pretraining data to 100M+ hours using next-token prediction, then apply instruction-tuning with thinking mechanisms for both audio understanding and generation tasks.

Result: MiMo-Audio-7B-Base achieves SOTA among open-source models on speech/audio benchmarks, generalizes to unseen tasks (voice conversion, style transfer), and generates realistic speech continuations. MiMo-Audio-7B-Instruct reaches SOTA on multiple benchmarks, approaching closed-source models.

Conclusion: Scaling audio pretraining enables emergent few-shot learning and strong generalization, similar to text models. The approach bridges the gap between specialized audio models and human-like generalization capabilities.

Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio’s pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

[6] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

Tingwei Xie, Tianyi Zhou, Yonghong Song

Main category: cs.CL

TL;DR: PharmaShip is a Chinese pharmaceutical shipping document dataset for testing text-layout models on noisy OCR and diverse templates, with three tasks (SER, RE, ROP) and entity-centric evaluation.

Details

Motivation: To create a controlled benchmark for safety-critical document understanding in pharmaceutical domain that stresses text-layout models with real-world challenges like noisy OCR and heterogeneous templates.

Method: Created PharmaShip dataset with scanned pharmaceutical shipping documents, designed three complementary tasks (SER, RE, ROP), used entity-centric evaluation protocol, benchmarked five representative baselines (LiLT, LayoutLMv3-base, GeoLayoutLM and RORE variants), standardized preprocessing and optimization.

Result: Pixels and explicit geometry provide complementary inductive biases but neither alone is sufficient; reading-order-oriented regularization improves SER and EL; longer positional coverage stabilizes late-page predictions; ROP is accurate at word level but challenging at segment level.

Conclusion: PharmaShip establishes a reproducible benchmark for pharmaceutical document understanding and shows sequence-aware constraints are transferable bias for structure modeling.

Abstract: We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at https://github.com/KevinYuLei/PharmaShip.

[7] Noise-Driven Persona Formation in Reflexive Neural Language Generation

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: LN-RP is a computational framework that uses stochastic noise injection to study how personas emerge in LLMs, revealing three stable persona modes with distinct entropy patterns.

Details

Motivation: To develop a reproducible method for studying noise-driven persona emergence, reflexive generation dynamics, and long-range linguistic coherence in large language models.

Method: Inject stochastic noise seeds into initial generation state and analyze linguistic behavior across 152 generation cycles to observe nonlinear transitions and phase transitions.

Result: Revealed three stable persona modes with distinct entropy signatures, demonstrated reliable induction of phase transitions via external noise, and confirmed consistent persona retention with significant differences across modes (p < 0.01).

Conclusion: LN-RP provides a reproducible framework for studying reflexive generation, emergent behavior, and long-range linguistic coherence in LLMs through controlled noise injection.

Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.

[8] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

Main category: cs.CL

TL;DR: HarmTransform is a multi-agent debate framework that systematically transforms harmful queries into stealthier forms to improve LLM safety training data, outperforming baselines but revealing debate’s dual nature as both beneficial and potentially problematic.

Details

Motivation: Current LLM safety mechanisms focus on overtly dangerous content but fail to detect subtle threats where users disguise harmful intent through covert rephrasing, creating a significant gap in safety training data.

Method: HarmTransform uses a multi-agent debate framework with iterative critique and refinement among multiple agents to systematically transform harmful queries into stealthier forms while preserving underlying harmful intent.

Result: HarmTransform significantly outperforms standard baselines in producing effective query transformations, but analysis reveals debate acts as a double-edged sword - improving stealth while potentially introducing topic shifts and unnecessary complexity.

Conclusion: Multi-agent debate shows promise for generating comprehensive safety training data but has limitations, highlighting the need for balanced approaches that leverage debate’s benefits while mitigating its drawbacks.

Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.

[9] Emergent World Beliefs: Exploring Transformers in Stochastic Games

Adam Kamel, Tanish Rastogi, Michael Ma, Kailash Ranganathan, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs trained on poker hand history data learn to represent both deterministic game structure and stochastic features like equity, developing internal representations that correlate with theoretical belief states in partially observable environments.

Details

Motivation: To investigate whether LLMs can develop emergent world models in domains of incomplete information (POMDPs), extending prior work on perfect information games, using poker as a canonical test case.

Method: Pretrained a GPT-style model on Poker Hand History (PHH) data and probed its internal activations using primarily nonlinear probes to analyze learned representations.

Result: The model learns both deterministic structure (hand ranks) and stochastic features (equity) without explicit instruction, and these representations are decodeable and correlate with theoretical belief states.

Conclusion: LLMs can learn their own representation of stochastic environments in partially observable domains like poker, suggesting they develop internal world models even in incomplete information settings.

Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold’em Poker.

[10] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

Hengli Li, Zhaoxin Yu, Qi Shen, Chenxi Li, Mengmeng Wang, Tinglang Wu, Yipeng Kang, Yuxuan Wang, Song-Chun Zhu, Zixia Jia, Zilong Zheng

Main category: cs.CL

TL;DR: BEDA framework formalizes strategic dialogue acts (Adversarial/Alignment) via probabilistic constraints on generation, using belief estimation to guide dialogue acts, achieving state-of-the-art performance across adversarial, cooperative, and negotiation settings.

Details

Motivation: Prior work accurately estimates beliefs in strategic dialogue but lacks principled mechanisms to use those beliefs during generation, creating a gap between belief estimation and strategic dialogue act execution.

Method: BEDA framework formalizes Adversarial and Alignment dialogue acts via probabilistic constraints on generation. It consists of: (1) world set, (2) belief estimator for belief estimation, and (3) conditional generator that selects acts and generates utterances consistent with inferred beliefs.

Result: BEDA consistently outperforms strong baselines across three settings: Conditional Keeper Burglar (CKBG, adversarial) - improves success rate by at least 5.0 points across backbones and 20.6 points with GPT-4.1-nano; Mutual Friends (cooperative) - average improvement of 9.3 points; CaSiNo (negotiation) - achieves optimal deal relative to all baselines.

Conclusion: Casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue, bridging the gap between belief estimation and strategic act execution.

Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.

[11] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

Anwar Alajmi, Gabriele Pergola

Main category: cs.CL

TL;DR: A two-stage framework combining targeted training with reasoning-based inference to detect subtle, context-dependent sexist content that evades traditional methods.

Details

Motivation: Sexist content online is becoming more subtle and context-dependent, making it hard to detect with traditional methods. Challenges include overlapping linguistic, psychological, legal, and cultural dimensions that create mixed signals, label scarcity, class imbalance, and conceptual ambiguity in both data and model predictions.

Method: Two-stage framework: (1) Training stage uses class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to handle label imbalance and noisy supervision. (2) Inference stage uses dynamic routing - high-confidence cases are classified directly, while uncertain instances are escalated to a Collaborative Expert Judgment (CEJ) module that prompts multiple personas and consolidates their reasoning through a judge model.

Result: Achieves state-of-the-art results: +2.72% improvement in F1 on EXIST 2025 Task 1.1, and gains of +4.48% and +1.30% on EDOS Tasks A and B respectively.

Conclusion: The proposed framework effectively addresses the combined challenges of underrepresentation, noise, and conceptual ambiguity in detecting subtle sexist content, outperforming existing methods through its unified approach of targeted training and reasoning-based inference.

Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48% and +1.30% on the EDOS Tasks A and B, respectively.

[12] Break Out the Silverware – Semantic Understanding of Stored Household Items

Michaela Levi-Richter, Reuth Mirsky, Oren Glickman

Main category: cs.CL

TL;DR: The paper introduces the Stored Household Item Challenge benchmark for evaluating robots’ ability to predict where everyday items are stored in homes, and presents NOAM, a hybrid vision-language model that approaches human-level performance on this task.

Details

Motivation: Domestic service robots lack commonsense reasoning to find everyday items stored out of sight in drawers, cabinets, or closets, despite advances in vision and manipulation. A benchmark is needed to evaluate cognitive capabilities for this fundamental household task.

Method: Introduces NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline combining structured scene understanding with LLM inference. It converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer hidden storage locations.

Result: NOAM significantly improves prediction accuracy compared to baselines (random selection, vision-language pipelines, multimodal models) and approaches human-level performance. The benchmark includes two datasets: 100 real-world item-image pairs and 6,500 development pairs with storage polygons.

Conclusion: The Stored Household Item Challenge provides a realistic benchmark for evaluating service robots’ cognitive capabilities. NOAM demonstrates emergent commonsense reasoning and best practices for deploying cognitively capable agents in domestic environments through integrated vision-language approaches.

Abstract: ``Bring me a plate.’’ For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots’ cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants’ kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

[13] Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

Yun Tang, Cindy Tseng

Main category: cs.CL

TL;DR: Chunk SSL: A unified self-supervised learning method for both streaming and offline speech pre-training using chunk-based masked prediction with finite scalar quantization and group masking.

Details

Motivation: Current self-supervised learning algorithms assume full utterances, requiring compromises for streaming applications with partial utterances. There's a need for a unified solution that works for both streaming and offline speech pre-training.

Method: Proposes chunk-based SSL with masked prediction loss where acoustic encoder restores masked speech frames using unmasked frames in same and preceding chunks. Uses copy-and-append data augmentation, finite scalar quantization (FSQ) with large codebooks (millions of entries), and group masked prediction to handle computational costs.

Result: Achieves competitive results on Librispeech and Must-C datasets for speech recognition and speech translation tasks in both streaming and offline modes.

Conclusion: Chunk SSL provides an effective unified solution for speech pre-training that works equally well for streaming and offline applications, with large FSQ codebooks improving knowledge transfer to downstream tasks.

Abstract: Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the \textsc{Librispeech} and \textsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.

[14] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su, Meicong Zhang, Guoxiu He

Main category: cs.CL

TL;DR: EASD is a training-free speculative decoding enhancement that uses entropy-based penalties to reject low-confidence draft tokens, enabling performance that can surpass the target LLM while maintaining efficiency comparable to standard SD.

Details

Motivation: Standard speculative decoding is limited by excessive alignment between draft and target models, constraining performance to that of the target LLM. The authors aim to overcome this limitation by allowing the draft model to potentially surpass the target model's performance.

Method: EASD builds on standard speculative decoding by adding a dynamic entropy-based penalty. It uses the entropy of sampling distributions to quantify model uncertainty, rejecting tokens when both models show high entropy with substantial overlap in top-N predictions, then re-sampling with the target LLM.

Result: Experiments across multiple reasoning benchmarks show EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. The efficiency of EASD is proven to be comparable to standard SD.

Conclusion: EASD successfully addresses the limitation of standard speculative decoding by incorporating draft-model verification through entropy-based penalties, enabling performance that can exceed the target LLM while maintaining computational efficiency.

Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model’s inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.

[15] Fun-Audio-Chat Technical Report

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou

Main category: cs.CL

TL;DR: Fun-Audio-Chat is a Large Audio Language Model that addresses temporal resolution mismatch and catastrophic forgetting in joint speech-text models through dual-resolution processing and core-cocktail training.

Details

Motivation: Existing joint speech-text models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge.

Method: 1) Dual-Resolution Speech Representations (DRSR): Shared LLM processes audio at efficient 5Hz via token grouping, while Speech Refined Head generates high-quality tokens at 25Hz. 2) Core-Cocktail Training: Two-stage fine-tuning with intermediate merging to mitigate catastrophic forgetting. 3) Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy.

Result: Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They show competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy.

Conclusion: Fun-Audio-Chat successfully addresses key limitations of existing models by balancing efficiency and quality while retaining text LLM knowledge, achieving strong performance across multiple audio-language tasks without requiring large-scale audio-text pre-training.

Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .

[16] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

Amal Alqahtani, Efsun Kayi, Mona Diab

Main category: cs.CL

TL;DR: StressRoBERTa: Cross-condition transfer learning approach for detecting self-reported chronic stress in English tweets, achieving 82% F1-score by training on stress-related disorders (depression, anxiety, PTSD) and outperforming previous systems.

Details

Motivation: Chronic stress is a major public health issue, and social media platforms like Twitter provide valuable data where people share their experiences. There's a need for automated detection methods to identify self-reported chronic stress in social media posts.

Method: StressRoBERTa uses cross-condition transfer learning: RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD), then fine-tuned on the SMM4H 2022 Task 8 dataset for chronic stress detection.

Result: Achieves 82% F1-score on the SMM4H 2022 Task 8 dataset, outperforming the best shared task system (79% F1) by 3 percentage points. Also achieves 81% F1 on Dreaddit dataset, demonstrating transfer from clinical mental health contexts to situational stress discussions.

Conclusion: Cross-condition transfer learning from stress-related disorders (depression, anxiety, PTSD) provides stronger representations for chronic stress detection than general language models or broad mental health models, enabling effective transfer from clinical to situational stress contexts.

Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.

[17] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

Himel Ghosh

Main category: cs.CL

TL;DR: Comparative interpretability study of transformer-based bias detection models using SHAP explanations reveals architectural differences affect reliability, with domain-adaptive models producing 63% fewer false positives.

Details

Motivation: Automated bias detection models are widely used but poorly understood in how they make decisions or why they fail, creating a need for interpretability studies to improve reliability and deployment suitability.

Method: Comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on BABE dataset, using SHAP-based explanations to analyze word-level attributions across correct and incorrect predictions.

Result: Both models attend to similar evaluative language categories but differ in signal integration. The bias detector model assigns stronger internal evidence to false positives than true positives, leading to systematic over-flagging of neutral content. The domain-adaptive model shows better alignment between attribution patterns and prediction outcomes, producing 63% fewer false positives. Errors arise from discourse-level ambiguity rather than explicit bias cues.

Conclusion: Interpretability-aware evaluation is crucial for bias detection systems, and architectural/training choices critically affect model reliability and deployment suitability in journalistic contexts, with domain-adaptive approaches showing superior performance.

Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

[18] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

Dingmin Wang, Ji Ma, Shankar Kumar

Main category: cs.CL

TL;DR: Adaptive prompting strategy splits retrieved information into chunks for LLM question answering, balancing relevant vs. irrelevant information while reducing token usage.

Details

Motivation: Longer context windows in LLMs introduce more irrelevant information that hinders generation and degrades performance in retrieval-augmented question answering.

Method: Design adaptive prompting strategy that splits retrieved information into smaller chunks and sequentially prompts LLM to answer using each chunk, with adjustable chunk size for trade-off between relevant and irrelevant information.

Result: Experimental results on three open-domain QA datasets show adaptive strategy matches standard prompting performance while using fewer tokens.

Conclusion: LLMs often generate incorrect answers instead of declining when faced with insufficient information, highlighting need for research to enhance LLMs’ ability to effectively decline requests with inadequate information.

Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model’s generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs’ ability to effectively decline requests when faced with inadequate information.

[19] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

Main category: cs.CL

TL;DR: The paper proposes using intermediate attention layer token distributions to generate adversarial examples for LLMs, showing they can degrade evaluation performance while maintaining semantic similarity.

Details

Motivation: Recent mechanistic interpretability research suggests attention layers encode token-level hypotheses that get refined. The authors want to exploit this property to create adversarial examples directly from model-internal token predictions, producing more plausible and internally consistent perturbations than prompt-based or gradient-based attacks.

Method: Extract tokens from intermediate attention layers to use as adversarial perturbations. Evaluate on argument quality assessment using ArgQuality dataset with LLaMA-3.1-Instruct-8B as both generator and evaluator. Test whether these attention-based perturbations can effectively degrade downstream evaluation performance.

Result: Attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to original inputs. However, substitutions from certain layers and token positions can introduce grammatical degradation, limiting practical effectiveness.

Conclusion: Intermediate-layer representations show promise as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines, but current limitations exist due to grammatical degradation issues from certain layer/position substitutions.

Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model’s own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

[20] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Yukun Zhang, Stefan Elbl Droguett, Samyak Jain

Main category: cs.CL

TL;DR: The paper proposes a multi-retriever RAG system for financial numerical QA, combining domain knowledge retrieval with LLMs to address financial reasoning challenges.

Details

Motivation: Financial numerical QA tasks are challenging due to lack of domain-specific financial knowledge in LLMs, requiring both financial expertise and complex multi-step numeric reasoning.

Method: Implemented a multi-retriever RAG system that retrieves both external domain knowledge and internal question contexts, using SecBERT encoder for domain-specific training and latest LLMs for generation.

Result: Domain-specific training with SecBERT surpassed baseline FinQA model; best prompt-based LLM achieved SOTA with >7% improvement but still below human expert performance. Larger models benefit more from external knowledge despite hallucination risks.

Conclusion: Domain-specific training enhances financial QA performance, with larger LLMs benefiting more from external knowledge. Latest LLMs show improved few-shot numerical reasoning capabilities, though gaps remain with human expertise.

Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper’s top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

[21] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Conrad Borchers, Manit Patel, Seiyon M. Lee, Anthony F. Botelho

Main category: cs.CL

TL;DR: The paper presents an analytics framework that separates student content from teacher grading tendencies to make scoring judgments more transparent and auditable.

Details

Motivation: Automated scoring of open-ended responses often conflates what students wrote with how teachers grade, making it difficult to distinguish actual student understanding from rater tendencies.

Method: Using ASSISTments math responses, the authors model teacher histories as dynamic priors and derive text representations from sentence embeddings, using centering and residualization to mitigate prompt and teacher confounds. They use temporally-validated linear models and projections to surface model disagreements.

Result: Teacher priors heavily influence grade predictions; combining priors with content embeddings yields strongest results (AUC~~0.815), while content-only models are substantially weaker (AUC~~0.626). Adjusting for rater effects sharpens content representation and reveals cases where semantic evidence supports understanding.

Conclusion: The framework transforms embeddings from mere features into learning analytics for reflection, enabling examination of where grading practices align or conflict with evidence of student reasoning and learning.

Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~~0.815), while content-only models remain above chance but substantially weaker (AUC~~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

[22] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

Main category: cs.CL

TL;DR: HGMem introduces a hypergraph-based memory mechanism for multi-step RAG that captures high-order correlations between facts, enabling more integrated reasoning compared to traditional passive memory storage.

Details

Motivation: Existing RAG memory designs function as passive storage that accumulates isolated facts, overlooking crucial high-order correlations among primitive facts. This static nature limits representational strength and impact on multi-step reasoning, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts.

Method: HGMem uses a hypergraph-based memory mechanism where memory is represented as a hypergraph with hyperedges corresponding to distinct memory units. This enables progressive formation of higher-order interactions within memory, connecting facts and thoughts around the focal problem to evolve into an integrated knowledge structure.

Result: Extensive experiments on challenging datasets for global sense-making show that HGMem consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

Conclusion: HGMem extends memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding, providing stronger propositions for deeper reasoning in subsequent steps of multi-step RAG systems.

Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

[23] Efficient Context Scaling with LongCat ZigZag Attention

Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang, Feifan Liu, Guanyu Wu, Yuwei Jiang, Defei Bu, Li Wei, Haihang Jing, Hongyin Tang, Xin Chen, Xiangzhou Huang, Fengcun Li, Rongxiang Weng, Yulei Qian, Yifan Lu, Yerui Sun, Jingang Wang, Yuchen Xie, Xunliang Cai

Main category: cs.CL

TL;DR: LoZA is a sparse attention scheme that converts full-attention models to sparse versions with limited compute, enabling efficient processing of up to 1M tokens for long-context applications.

Details

Motivation: To enable efficient long-context processing (up to 1M tokens) with limited computational budget, addressing the challenges of both prefill-intensive (retrieval-augmented generation) and decode-intensive (tool-integrated reasoning) scenarios.

Method: LongCat ZigZag Attention (LoZA) - a sparse attention scheme that transforms existing full-attention models into sparse versions. Applied to LongCat-Flash during mid-training to create LongCat-Flash-Exp.

Result: Significant speed-ups in long-context scenarios, enabling processing of up to 1 million tokens. LongCat-Flash-Exp serves as a long-context foundation model with efficient long-term reasoning and long-horizon agentic capabilities.

Conclusion: LoZA provides an effective sparse attention solution for converting full-attention models to handle long contexts efficiently, enabling practical applications requiring million-token processing with limited compute resources.

Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

[24] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

Zhiming Lin, Kai Zhao, Sophie Zhang, Peilai Yu, Canran Xiao

Main category: cs.CL

TL;DR: CEC-Zero is a zero-supervision reinforcement learning framework that enables LLMs to correct Chinese spelling errors without labeled data, outperforming supervised methods by 10-13 F1 points.

Details

Motivation: Existing Chinese spelling correction methods lack robustness to novel errors and rely on costly annotations. LLMs and supervised methods struggle with real-world text processing due to these limitations.

Method: A reinforcement learning framework that synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO (Proximal Policy Optimization).

Result: Outperforms supervised baselines by 10-13 F1 points and strong LLM fine-tunes by 5-8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence.

Conclusion: CEC-Zero establishes a label-free paradigm for robust, scalable Chinese spelling correction, unlocking LLM potential in noisy text pipelines without requiring costly annotations.

Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10–13 F$_1$ points and strong LLM fine-tunes by 5–8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

[25] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang

Main category: cs.CL

TL;DR: RISE is an unsupervised framework using sparse auto-encoders to discover interpretable reasoning vectors in LLM activations, enabling behavior analysis and control without human supervision.

Details

Motivation: Current approaches to analyzing LLM reasoning rely on human-defined concepts at the word level, which is limited because it's infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space.

Method: Propose RISE framework: segment chain-of-thought traces into sentence-level steps, train sparse auto-encoders (SAEs) on step-level activations to discover reasoning vectors (directions in activation space encoding distinct reasoning behaviors).

Result: SAEs uncover disentangled features corresponding to interpretable behaviors (reflection, backtracking) that occupy separable regions in decoder space. Interventions on SAE-derived vectors can controllably amplify/suppress reasoning behaviors. SAEs also capture structural properties (response length) and enable discovery of novel behaviors beyond human supervision, including confidence-related vectors.

Conclusion: Unsupervised latent discovery via SAEs shows strong potential for both interpreting and controllably steering reasoning in LLMs, moving beyond limitations of human-supervised approaches.

Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level ‘steps’ and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

[26] WISE: Web Information Satire and Fakeness Evaluation

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Main category: cs.CL

TL;DR: WISE framework benchmarks lightweight transformer models for distinguishing fake news from satire, finding MiniLM achieves highest accuracy (87.58%) while RoBERTa-base has best ROC-AUC (95.42%).

Details

Motivation: Distinguishing fake news from satire/humor is challenging due to overlapping linguistic features but divergent intent, requiring effective detection systems for real-world misinformation scenarios.

Method: Developed WISE framework benchmarking 8 lightweight transformer models plus 2 baselines on 20,000 samples from Fakeddit dataset. Used stratified 5-fold cross-validation with comprehensive metrics (accuracy, precision, recall, F1, ROC-AUC, PR-AUC, MCC, Brier score, ECE).

Result: MiniLM achieved highest accuracy (87.58%), RoBERTa-base highest ROC-AUC (95.42%) with strong accuracy (87.36%). DistilBERT offered best efficiency-accuracy trade-off (86.28% accuracy, 93.90% ROC-AUC). Statistical tests confirmed significant performance differences between models.

Conclusion: Lightweight models can match or exceed baseline performance, providing actionable insights for deploying effective misinformation detection systems in resource-constrained real-world settings.

Abstract: Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28% accuracy and 93.90% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

[27] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

Sijia Chen, Di Niu

Main category: cs.CL

TL;DR: iCLP enables LLMs to generate compact latent plans for implicit reasoning, improving accuracy, efficiency, and cross-domain generalization while maintaining interpretability.

Details

Motivation: Current LLMs struggle with generating accurate textual plans due to hallucinations and task diversity. Inspired by human implicit cognition, the authors aim to develop a more efficient and reliable planning mechanism for LLMs.

Method: iCLP distills explicit plans from reasoning trajectories, learns discrete latent representations via vector-quantized autoencoder with codebook, and fine-tunes LLMs on paired latent plans and reasoning steps.

Result: LLMs with iCLP can plan in latent space while reasoning in language space, showing significant improvements in accuracy and efficiency on mathematical reasoning and code generation tasks, with strong cross-domain generalization.

Conclusion: iCLP successfully enables LLMs to perform implicit planning similar to human subconscious cognition, achieving better performance while preserving the interpretability of chain-of-thought reasoning.

Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.

[28] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Rohit Kumar Salla, Manoj Saravanan, Shrikar Reddy Kota

Main category: cs.CL

TL;DR: The paper introduces CRS, a unified framework that combines calibration, robustness, and uncertainty quantification into a single interpretable metric to comprehensively evaluate LLM reliability in critical domains.

Details

Motivation: LLMs are increasingly used in high-stakes domains (healthcare, law, finance) but suffer from overconfident errors, degradation under input shifts, and lack of clear uncertainty estimates. Existing evaluations are fragmented and only address isolated aspects of reliability.

Method: Introduces Composite Reliability Score (CRS) - a unified framework integrating calibration, robustness, and uncertainty quantification. Evaluates ten leading open-source LLMs across five QA datasets under baselines, perturbations, and calibration methods.

Result: CRS provides stable model rankings, uncovers hidden failure modes missed by single metrics, and reveals that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

Conclusion: CRS offers a comprehensive evaluation framework for LLM reliability that goes beyond fragmented single-metric approaches, enabling better assessment of models for decision-critical applications.

Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

[29] HY-MT1.5 Technical Report

Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, Di Wang

Main category: cs.CL

TL;DR: HY-MT1.5 introduces 1.8B and 7B parameter translation models that achieve state-of-the-art performance, rivaling much larger models and commercial APIs through a comprehensive multi-stage training framework.

Details

Motivation: To develop highly parameter-efficient machine translation models that can compete with and even surpass much larger open-source models and commercial translation APIs, while supporting advanced translation features.

Method: A holistic multi-stage training pipeline combining general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning.

Result: HY-MT1.5-1.8B outperforms significantly larger models (up to 72B parameters) and commercial APIs, achieving ~90% of Gemini-3.0-Pro’s performance. HY-MT1.5-7B achieves 95% of Gemini-3.0-Pro on Flores-200 and surpasses it on WMT25 and Mandarin-minority language benchmarks.

Conclusion: The HY-MT1.5 series provides highly competitive, robust translation solutions with exceptional parameter efficiency, supporting advanced features like terminology intervention and context-aware translation.

Abstract: In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro’s performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

[30] Training a Huggingface Model on AWS Sagemaker (Without Tears)

Liling Tan

Main category: cs.CL

TL;DR: A demo paper providing centralized guidance for researchers to train Hugging Face models on AWS SageMaker, addressing cloud platform learning barriers.

Details

Motivation: LLM development is dominated by resource-rich groups, forcing researchers to use cloud services like AWS SageMaker, but the steep learning curve and fragmented documentation create barriers to adoption.

Method: The paper centralizes essential information into a comprehensive guide/demo that walks researchers through training their first Hugging Face model on AWS SageMaker from scratch.

Result: A practical resource that democratizes cloud adoption by providing the necessary knowledge in one place, eliminating the need to search for fragmented information across the web.

Conclusion: By creating a centralized guide, this demo paper lowers the barrier to cloud adoption for researchers, enabling more equitable access to LLM training resources and democratizing AI development.

Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

[31] Activation Steering for Masked Diffusion Language Models

Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid

Main category: cs.CL

TL;DR: Activation-steering framework for masked diffusion language models enables efficient inference-time control using contrastive examples without simulating denoising trajectories.

Details

Motivation: Masked diffusion language models have shown competitive performance with autoregressive LLMs but lack effective mechanisms for inference-time control and steering.

Method: Compute layer-wise steering vectors from a single forward pass using contrastive examples, then apply these directions at every reverse-diffusion step for efficient control.

Result: Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining steering effects across transformer sub-modules and token scope.

Conclusion: The framework provides an efficient inference-time control mechanism for MDLMs without requiring denoising trajectory simulation.

Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

[32] Large Emotional World Model

Changhao Song, Yazhou Zhang, Hui Gao, Chang Yang, Peng Zhang

Main category: cs.CL

TL;DR: The paper introduces LEWM, a Large Emotional World Model that incorporates emotional states alongside visual observations and actions to better predict emotion-driven social behaviors, addressing the gap in existing LLMs that lack systematic emotional modeling.

Details

Motivation: Existing Large Language Models primarily focus on physical-world regularities but lack systematic exploration of emotional factors, which are crucial for understanding human decision-making and social behaviors. The authors demonstrate that removing emotionally relevant information degrades reasoning performance.

Method: Proposed LEWM (Large Emotional World Model) with three key components: 1) Construction of Emotion-Why-How (EWH) dataset integrating emotion into causal relationships, 2) Explicit modeling of emotional states alongside visual observations and actions, 3) Enabling the model to predict both future states and emotional transitions.

Result: LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks. The model demonstrates improved capability in understanding emotional dynamics in world modeling.

Conclusion: Incorporating emotional factors into world models is essential for better understanding human social behaviors and decision-making. LEWM provides a framework for modeling emotional transitions alongside physical world dynamics, bridging an important gap in current world modeling approaches.

Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

[33] Training Report of TeleChat3-MoE

Xinzhang Liu, Chao Wang, Zhihao Yang, Zhuo Jiang, Xuncheng Zhao, Haoran Wang, Lei Li, Dongdong He, Luobin Liu, Kaizhe Yuan, Han Gao, Zihan Wang, Yitong Yao, Sishi Xiong, Wenmin Deng, Haowei He, Kaidong Yu, Yu Zhao, Ruiyu Fang, Yuhao Jiang, Yingyan Li, Xiaohui Hu, Xi Yu, Jingqi Li, Yanwei Liu, Qingli Li, Xinyu Shi, Junhao Niu, Chengnuo Huang, Yao Xiao, Ruiwen Wang, Fengkai Li, Luwen Pu, Kaipeng Jia, Fubei Yao, Yuyao Huang, Xuewei He, Zhuoru Jiang, Ruiting Song, Rui Xue, Qiyi Xie, Jie Zhang, Zilu Huang, Zhaoxi Zhang, Zhilong Lu, Yanhan Zhang, Yin Zhang, Yanlei Xue, Zhu Yuan, Teng Su, Xin Jiang, Shuangyong Song, Yongxiang Li, Xuelong Li

Main category: cs.CL

TL;DR: TeleChat3-MoE is a new series of large language models with Mixture-of-Experts architecture (105B-1T+ parameters) trained on Ascend NPU clusters, with this report focusing on the training infrastructure that enables reliable and efficient scaling to frontier model sizes.

Details

Motivation: To develop reliable and efficient training infrastructure for scaling Mixture-of-Experts language models to frontier sizes (over 1 trillion parameters) on Ascend NPU hardware clusters, addressing challenges in numerical accuracy, performance optimization, and parallelization at large scale.

Method: Systematic methodologies for operator-level and end-to-end numerical accuracy verification; performance optimizations including interleaved pipeline scheduling, attention-aware data scheduling for long sequences, hierarchical/overlapped communication for expert parallelism, and DVM-based operator fusion; systematic parallelization framework using analytical estimation and integer linear programming; cluster-level optimizations addressing host- and device-bound bottlenecks.

Result: The infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

Conclusion: The presented training infrastructure enables reliable and efficient scaling of Mixture-of-Experts language models to frontier sizes on Ascend NPU clusters, with systematic approaches to numerical accuracy, performance optimization, and parallelization that achieve near-linear scaling on large-scale hardware deployments.

Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

[34] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Qipeng Wang, Rui Sheng, Yafei Li, Huamin Qu, Yushi Sun, Min Zhu

Main category: cs.CL

TL;DR: MedKGI is a clinical diagnostic framework that addresses LLM limitations by integrating medical knowledge graphs for grounded reasoning, information gain-based question selection for efficiency, and structured state tracking for coherence.

Details

Motivation: Current LLMs struggle with clinical diagnosis due to three critical issues: generating hallucinated medical content, asking redundant/inefficient questions, and losing coherence in multi-turn dialogues, preventing them from emulating real clinical diagnostic reasoning.

Method: MedKGI integrates a medical knowledge graph to constrain reasoning to validated ontologies, selects questions based on information gain to maximize diagnostic efficiency, and uses OSCE-format structured state to maintain consistent evidence tracking across dialogue turns.

Result: Experiments show MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy on clinical benchmarks.

Conclusion: MedKGI successfully addresses key LLM limitations in clinical diagnosis by grounding reasoning in verified knowledge, optimizing question selection, and maintaining dialogue coherence, making it a promising framework for clinical diagnostic applications.

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

[35] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

May Bashendy, Walid Massoud, Sohaila Eltanbouly, Salam Albatarni, Marwan Sayed, Abrar Abir, Houda Bouamor, Tamer Elsayed

Main category: cs.CL

TL;DR: Introduces LAILA, the largest publicly available Arabic Automated Essay Scoring dataset with 7,859 essays annotated on seven dimensions, providing benchmark results for Arabic AES research.

Details

Motivation: Research on Arabic Automated Essay Scoring (AES) is limited due to lack of publicly available datasets, creating a need for comprehensive Arabic AES resources.

Method: Created LAILA dataset with 7,859 essays annotated with holistic and trait-specific scores across seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. Provided benchmark results using state-of-the-art Arabic and English models in both prompt-specific and cross-prompt settings.

Result: LAILA is now the largest publicly available Arabic AES dataset, filling a critical gap in Arabic AES research and supporting development of robust scoring systems.

Conclusion: LAILA addresses the critical need for publicly available Arabic AES datasets and will facilitate further research and development of Arabic essay scoring systems.

Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

[36] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

Michael E. Rose, Mainak Ghosh, Sebastian Erhardt, Cheng Li, Erik Buunk, Dietmar Harhoff

Main category: cs.CL

TL;DR: Pat-SPECTER model outperforms other language similarity models in predicting credible patent-paper citations, with applications in separating and predicting patent-paper pairs, plus analysis of US citation patterns.

Details

Motivation: Need for a language similarity model that works effectively with both patents and scientific publications simultaneously, to better understand and predict connections between these two important knowledge domains.

Method: Developed Pat-SPECTER by fine-tuning SPECTER2 model on patents, then conducted horse race-style evaluation comparing eight language similarity models on their ability to predict credible Patent-Paper Citations.

Result: Pat-SPECTER performed best among all tested models. Demonstrated practical capabilities in two real-world scenarios: separating patent-paper-pairs and predicting patent-paper-pairs. Found evidence that US patents cite papers that are semantically less similar than in other jurisdictions, potentially due to duty of candor requirements.

Conclusion: Pat-SPECTER is an effective language similarity model for patent-paper analysis, with open availability for academic and practitioner use, and reveals interesting jurisdictional differences in citation patterns.

Abstract: We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.

[37] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Ziqing Fan, Yuqiao Xian, Yan Sun, Li Shen

Main category: cs.CL

TL;DR: DATAMASK is an efficient joint learning framework for large-scale pre-training data selection that simultaneously optimizes quality and diversity metrics through policy gradient optimization, achieving significant performance gains with 10% data selection.

Details

Motivation: Current data selection methods for trillion-scale pre-training datasets either use quality metrics (which show diminishing returns) or diversity metrics (which remove valuable high-quality samples), limiting LLM capabilities. There's a need for an efficient joint optimization approach.

Method: DATAMASK treats data selection as a mask learning problem using policy gradient optimization: iteratively samples data masks, computes policy gradients based on predefined objectives, updates mask sampling logits, with acceleration enhancements reducing selection time by 98.9% vs greedy algorithms.

Result: Selected 10% subset (FineWeb-Mask) from 15 trillion-token FineWeb dataset; achieved 3.2% improvement on 1.5B dense model and 1.9% on 7B MoE model across 12 diverse tasks.

Conclusion: DATAMASK enables efficient joint optimization of quality and diversity metrics at trillion-scale, significantly improving model performance with minimal data selection while overcoming limitations of single-metric approaches.

Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs’ capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

[38] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Jonathan Schmoll, Adam Jatowt

Main category: cs.CL

TL;DR: First systematic evaluation of LLMs for EU Taxonomy compliance shows they can assist but not fully automate the process, with clear performance gaps between qualitative and quantitative tasks.

Details

Motivation: Manual EU Taxonomy compliance is resource-intensive, and while LLMs could automate it, research is hindered by lack of public benchmark datasets.

Method: Created novel structured dataset from 190 corporate reports with ground-truth economic activities and KPIs, then conducted systematic evaluation of LLMs on core compliance workflow.

Result: LLMs show moderate success in qualitative activity identification (improved with multi-step framework) but fail at quantitative KPI prediction. Paradoxically, concise metadata outperforms full reports, and confidence scores are poorly calibrated.

Conclusion: LLMs are not ready for full automation but can serve as powerful assistive tools for human experts. The dataset provides a public benchmark for future research.

Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

[39] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Meiqi Chen, Fandong Meng, Jie Zhou

Main category: cs.CL

TL;DR: FIGR integrates visual thinking into reasoning via reinforcement learning, constructing visual representations to handle implicit spatial/structural relationships, outperforming text-only models on math reasoning benchmarks.

Details

Motivation: Complex reasoning problems involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. Purely text-based reasoning struggles to represent global structural constraints in complex settings.

Method: FIGR integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. It externalizes intermediate structural hypotheses by constructing visual representations during problem solving, and adaptively regulates when and how visual reasoning should be invoked.

Result: FIGR outperforms strong text-only chain-of-thought baselines, improving the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME benchmarks.

Conclusion: Figure-guided multimodal reasoning enhances the stability and reliability of complex reasoning by enabling more coherent reasoning over global structural properties that are difficult to capture from text alone.

Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

[40] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

Shupeng Li, Weipeng Lu, Linyun Liu, Chen Lin, Shaofei Li, Zhendong Tan, Hanjun Zhong, Yucheng Zeng, Chenghao Zhu, Mengyue Liu, Daxiang Dong, Jianmin Wu, Yunting Xiao, Annan Li, Danyu Liu, Jingnan Zhang, Licen Liu, Dawei Yin, Dou Shen

Main category: cs.CL

TL;DR: QianfanHuijin is a financial LLM with a multi-stage training paradigm combining knowledge enhancement, reasoning, and agentic capabilities, achieving superior performance on financial benchmarks.

Details

Motivation: Existing financial LLMs focus mainly on knowledge enhancement, but complex financial services require models with robust reasoning and agentic capabilities beyond just domain knowledge.

Method: Multi-stage training: 1) Continual Pre-training on financial corpora, 2) Fine-grained Post-training pipeline with Financial SFT → Finance Reasoning RL → Finance Agentic RL → General RL aligned with real-world business scenarios.

Result: QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Ablation studies confirm Reasoning RL and Agentic RL stages yield significant gains in respective capabilities.

Conclusion: The fine-grained, progressive post-training methodology is validated and poised to become a mainstream paradigm for various industrial-enhanced LLMs.

Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.

[41] World model inspired sarcasm reasoning with large language model agents

Keito Inoshita, Shinnosuke Mizuno

Main category: cs.CL

TL;DR: WM-SAR reformulates sarcasm understanding as world model reasoning, decomposing literal meaning, context, normative expectation, and intention into specialized LLM agents, then computes inconsistency and intention scores for interpretable sarcasm detection.

Details

Motivation: Existing sarcasm detection methods rely on black-box predictions without structural explanations of cognitive factors. Sarcasm involves mismatch between semantic evaluation and normative expectations/intentions, but frameworks explicitly modeling these components are limited.

Method: Proposes WM-SAR (World Model inspired SArcasm Reasoning) with specialized LLM-based agents for literal meaning, context, normative expectation, and intention. Computes deterministic inconsistency score between literal evaluation and normative expectation, plus intention score, integrated via lightweight Logistic Regression for final sarcasm probability.

Result: WM-SAR consistently outperforms existing deep learning and LLM-based methods on sarcasm detection benchmarks. Ablation studies show integrating semantic inconsistency and intention reasoning is essential for effective detection, achieving both strong performance and high interpretability.

Conclusion: The world model inspired reasoning approach successfully decomposes sarcasm understanding into interpretable components, leveraging LLM reasoning while maintaining transparent decision structure, addressing both performance and explainability gaps in sarcasm detection.

Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker’s intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

[42] Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro, Zied Bouraoui

Main category: cs.CL

TL;DR: New self-supervised contrastive learning framework for long document representation that mimics human skimming by masking sections and using NLI-based contrastive objectives.

Details

Motivation: Existing transformer models struggle with long document representation in specialized fields like law and medicine. Sparse attention is resource-intensive, hierarchical transformers lack explainability, while humans effectively skim texts by focusing on important sections.

Method: Self-supervised contrastive learning framework that randomly masks document sections and uses natural language inference (NLI)-based contrastive objectives to align masked sections with relevant parts while distancing from unrelated ones, mimicking human information synthesis.

Result: Experiments on legal and biomedical texts show significant improvements in both accuracy and computational efficiency compared to existing approaches.

Conclusion: The human-inspired approach of selective attention through section masking and contrastive learning provides an effective and efficient solution for long document representation in specialized domains.

Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

[43] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Chester Palen-Michel, Constantine Lignos

Main category: cs.CL

TL;DR: Comparison of text summarization approaches for less-resourced languages shows multilingual fine-tuned mT5 outperforms most methods including zero-shot LLMs, while LLM-as-judge evaluation may be unreliable for these languages.

Details

Motivation: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages, creating a research gap.

Method: Compares multiple approaches: zero-shot prompting of various sized LLMs, fine-tuning smaller models (mT5) with/without data augmentation and multilingual transfer, and an LLM translation pipeline (translate-summarize-translate back). Evaluates with five different metrics.

Result: Multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics. Variation exists across LLMs in performance across similar parameter sizes. LLM as judge may be less reliable on less-resourced languages.

Conclusion: For less-resourced language summarization, multilingual fine-tuning of models like mT5 is more effective than zero-shot LLM approaches, and evaluation methods need careful consideration for these languages.

Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

[44] Cleaning English Abstracts of Scientific Publications

Michael E. Rose, Nils A. Herrmann, Sebastian Erhardt

Main category: cs.CL

TL;DR: A language model for cleaning scientific abstracts by removing extraneous information like copyright statements and metadata to improve downstream text analysis.

Details

Motivation: Scientific abstracts often contain extraneous information (copyright statements, section headings, author notes, registrations, bibliometric metadata) that distorts downstream analyses like document similarity and textual embeddings.

Method: Developed an open-source, easy-to-integrate language model designed to automatically identify and remove clutter from English-language scientific abstracts.

Result: The model is both conservative and precise, alters similarity rankings of cleaned abstracts, and improves information content of standard-length embeddings.

Conclusion: The introduced model effectively cleans scientific abstracts, enhancing the quality of text-based analyses by removing distorting extraneous information.

Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

[45] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

Titas Ramancauskas, Kotryna Ramancauske

Main category: cs.CL

TL;DR: A revision platform for IELTS writing exam preparation using transformer-based AES with adaptive feedback, showing significant score improvements but works best as supplement to human instruction.

Details

Motivation: Traditional IELTS preparation lacks personalized feedback tailored to the IELTS writing rubric, creating a need for automated systems that provide targeted feedback to candidates.

Method: Design-Based Research (DBR) cycles progressing from rule-based to transformer-based (DistilBERT with regression head) Automated Essay Scoring, with platform architecture separating conversational guidance from writing interface to reduce cognitive load.

Result: Transformer model achieved MAE of 0.66 and positive R², enabling adaptive feedback that demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen’s d = 0.504), though effectiveness varied by revision strategy.

Conclusion: Automated feedback functions best as supplement to human instruction, with conservative surface-level corrections more reliable than aggressive structural interventions for IELTS preparation; challenges remain in assessing higher-band essays.

Abstract: This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5’s adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen’s d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

[46] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Fabian Retkowski, Alexander Waibel

Main category: cs.CL

TL;DR: This paper introduces paragraph segmentation for speech transcripts, creates two new benchmarks (TEDPara and YTSegPara), proposes a constrained-decoding method for LLMs to insert paragraph breaks, and develops MiniSeg model for state-of-the-art paragraph and chapter segmentation.

Details

Motivation: Automatic speech transcripts are unstructured word streams that hinder readability and repurposing. Paragraph segmentation is missing as a structuring step in speech processing, and there's a lack of robust benchmarks for this task in the speech domain.

Method: 1) Created TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as first benchmarks for paragraph segmentation in speech. 2) Proposed constrained-decoding formulation for LLMs to insert paragraph breaks while preserving original transcripts. 3) Developed MiniSeg, a compact model that achieves state-of-the-art accuracy and can be extended hierarchically for joint chapter and paragraph prediction.

Result: Established first benchmarks for paragraph segmentation in speech domain. MiniSeg attains state-of-the-art accuracy for paragraph segmentation and can jointly predict chapters and paragraphs with minimal computational cost. The constrained-decoding method enables faithful, sentence-aligned evaluation.

Conclusion: The paper establishes paragraph segmentation as a standardized, practical task in speech processing through new benchmarks, methods, and models that address the gap between speech processing and text segmentation.

Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.

[47] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said, Muhammad Sammani Sani

Main category: cs.CL

TL;DR: LLM safety alignment doesn’t transfer zero-shot from English to other languages; HausaSafety dataset reveals Complex Interference where safety depends on language-temporal interactions, not simple degradation.

Details

Motivation: The dangerous assumption that safety alignment transfers zero-shot from English to other languages in LLMs integrated into critical global infrastructure, particularly exposing Global South users to localized harms.

Method: Systematic audit of GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus using HausaSafety - novel adversarial dataset with West African threat scenarios. 2x4 factorial design across 1,440 evaluations testing non-linear interaction between language (English vs. Hausa) and temporal framing.

Result: Found Complex Interference mechanism instead of simple degradation: Claude 4.5 Opus safer in Hausa (45.0%) than English (36.7%) due to uncertainty-driven refusal, but catastrophic failures in temporal reasoning. Profound Temporal Asymmetry - past-tense framing bypassed defenses (15.6% safe) while future-tense triggered hyper-conservative refusals (57.2% safe). 9.2x disparity between safest and most vulnerable configurations.

Conclusion: Current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that expose Global South users. Propose Invariant Alignment as necessary paradigm shift for safety stability across linguistic and temporal shifts.

Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[48] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Chaodong Tong, Qi Zhang, Jiayang Gao, Lei Jiang, Yanbing Liu, Nannan Sun

Main category: cs.CL

TL;DR: HaluNet is a lightweight neural framework that integrates token-level probability uncertainty with semantic representation uncertainty for efficient one-pass hallucination detection in LLM-based QA systems.

Details

Motivation: LLMs often generate hallucinations (factual errors or fabricated content) in QA tasks. Existing hallucination detection methods focus on single uncertainty types and overlook the complementarity between token-level probability uncertainty and internal semantic representation uncertainty, which provide complementary views on model reliability.

Method: HaluNet uses a lightweight, trainable neural framework that integrates multi-granular token-level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. It features a multi-branch architecture that adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one-pass hallucination detection.

Result: Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real-time hallucination detection.

Conclusion: HaluNet demonstrates that integrating complementary uncertainty sources (probability and semantic representations) enables effective and efficient hallucination detection in LLM-based QA systems, offering a scalable solution that doesn’t rely on external resources.

Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.

[49] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities

Hongseok Oh, Wonseok Hwang, Kyoung-Woon On

Main category: cs.CL

TL;DR: KCL is a Korean legal reasoning benchmark that separates reasoning ability from domain knowledge by providing question-level supporting precedents, with MCQA and Essay components for comprehensive evaluation.

Details

Motivation: To create a benchmark that assesses language models' legal reasoning capabilities independently of domain-specific knowledge, enabling faithful disentanglement of reasoning ability from parameterized knowledge.

Method: Developed KCL with two components: (1) KCL-MCQA - 283 multiple-choice questions with 1,103 aligned precedents, and (2) KCL-Essay - 169 open-ended generation questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation.

Result: Evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform general-purpose counterparts.

Conclusion: KCL provides a valuable resource for assessing legal reasoning capabilities, revealing significant challenges in legal reasoning tasks and demonstrating the advantage of reasoning-specialized models over general-purpose ones.

Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models’ legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

[50] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Zhenyu Zhang, Xiaoxia Wu, Zhongzhu Zhou, Qingyang Wu, Yineng Zhang, Pragaash Ponnusamy, Harikaran Subbaraj, Jue Wang, Shuaiwen Leon Song, Ben Athiwaratkun

Main category: cs.CL

TL;DR: CREST is a training-free method that steers LLM reasoning by identifying and suppressing inefficient cognitive behaviors in attention heads, improving accuracy while reducing token usage.

Details

Motivation: Current LLMs using chain-of-thought reasoning suffer from inefficiencies: high latency from excessive token generation, and unstable reasoning that alternates between underthinking (shallow steps) and overthinking (repetitive reasoning).

Method: CREST has two components: (1) offline calibration to identify cognitive heads (correlated with behaviors like verification/backtracking) and derive head-specific steering vectors, and (2) inference-time procedure that rotates hidden representations to suppress components along those vectors.

Result: Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%.

Conclusion: CREST offers a simple and effective training-free pathway to faster, more reliable LLM reasoning by adaptively suppressing unproductive reasoning behaviors.

Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

[51] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan

Main category: cs.CL

TL;DR: Youtu-LLM is a 1.96B parameter lightweight language model pre-trained from scratch with native agentic intelligence, featuring long-context support (128k), specialized STEM vocabulary, and a progressive curriculum from commonsense to complex tasks.

Details

Motivation: To create a lightweight language model that doesn't rely on distillation but has intrinsic agentic capabilities, addressing the gap where small models typically lack strong reasoning and planning abilities needed for complex tasks.

Method: Three key technical advancements: 1) Compact MLA architecture with STEM-oriented vocabulary and 128k context window, 2) Multi-stage curriculum training on 11T tokens progressing from commonsense to STEM to agentic tasks, 3) Scalable agentic mid-training with diverse trajectory synthesis for math, coding, and tool-use domains.

Result: Youtu-LLM sets new SOTA for sub-2B LLMs, achieving competitive performance on general benchmarks against larger models and significantly surpassing existing baselines on agent-specific tasks.

Conclusion: Lightweight models can possess strong intrinsic agentic capabilities when systematically trained with appropriate architecture and curriculum, challenging the assumption that small models must rely on distillation from larger ones.

Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled “Commonsense-STEM-Agent” Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

[52] Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan, Sid Black, Oliver Sourbut

Main category: cs.CL

TL;DR: LLMs are overconfident in predicting their task success, with overconfidence worsening during multi-step tasks, but some can learn from failure experiences to improve decision-making.

Details

Motivation: To investigate whether LLMs can accurately predict their own task success, whether their predictions improve during multi-step tasks, and whether they can learn from in-context failure experiences to make better decisions in costly-failure scenarios.

Method: Tested multiple LLMs on their ability to predict task success, examined prediction changes during multi-step agentic tasks, and evaluated whether in-context failure experiences improve decision-making about pursuing costly tasks.

Result: All LLMs were overconfident but had better-than-random discriminatory power. Newer/larger models didn’t show greater discriminatory power (except Claude). Overconfidence worsened during multi-step tasks. Some LLMs reduced overconfidence with failure experiences, improving decision-making, while others didn’t. LLMs’ decisions were rational given their estimates, but overly-optimistic estimates led to poor decisions.

Conclusion: Current LLM agents lack awareness of their own capabilities, hindering their performance. This has implications for AI misuse and misalignment risks, suggesting that improving LLMs’ self-awareness could enhance their decision-making and safety.

Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs’ decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs’ awareness of their capabilities for AI misuse and misalignment risks.

[53] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Maoyuan Li, Zhongsheng Wang, Haoyuan Li, Jiamou Liu

Main category: cs.CL

TL;DR: R-Debater is an agentic framework for multi-turn debates that uses argumentative memory to recall and adapt prior arguments, maintaining stance consistency and supporting claims with evidence.

Details

Motivation: To create more coherent and consistent multi-turn debates by grounding them in rhetoric and memory studies, addressing the need for stance consistency, evidence-based arguments, and adaptive responses to opponents across debate turns.

Method: Integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. Uses retrieval grounding with structured planning.

Result: Achieves higher single-turn and multi-turn scores compared to strong LLM baselines on ORCHID debates. Human evaluation with 20 experienced debaters confirms better consistency and evidence use.

Conclusion: Combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns, demonstrating the value of argumentative memory in debate systems.

Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

[54] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li, Shujian Zhang, Wenxuan Zhou, John Lambert, Chi Jin, Andrew Hard, Rajiv Mathews, Lun Wang

Main category: cs.CL

TL;DR: MUSIC is an unsupervised data augmentation method that creates multi-turn contrastive conversation pairs to train better multi-turn reward models for LLM evaluation, outperforming baselines while maintaining single-turn performance.

Details

Motivation: Multi-turn conversation evaluation is crucial for LLM development but expensive with human evaluation. Current multi-turn reward models lack effective automated evaluation methods, and standard preference datasets focusing only on final turns fail to capture multi-turn nuances.

Method: Proposes MUSIC (Multi-Step Instruction Contrast), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs with differences across multiple turns. Applied to Skywork preference dataset to train a multi-turn RM based on Gemma-2-9B-Instruct.

Result: MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, without compromising performance on standard single-turn RM benchmarks.

Conclusion: Incorporating multi-turn contrasts is critical for building robust multi-turn reward models, and MUSIC provides an effective unsupervised approach to enhance multi-turn evaluation capabilities while maintaining single-turn performance.

Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

[55] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

Sibo Wei, Peng Chen, Lifeng Dong, Yin Luo, Lei Wang, Peng Zhang, Wenpeng Lu, Jianbin Guo, Hongjun Yang, Dajun Zeng

Main category: cs.CL

TL;DR: BIOME-Bench is a new benchmark for evaluating LLMs on multi-omics pathway analysis, addressing limitations of current pathway enrichment methods and enabling standardized assessment of biomolecular interaction inference and pathway mechanism elucidation.

Details

Motivation: Current pathway enrichment methods have structural limitations including curation lag, functional redundancy, and limited sensitivity to molecular states. Existing LLM evaluations for multi-omics analysis lack standardized benchmarks, relying on small curated datasets or ad hoc case studies, hindering reproducible progress.

Method: Developed BIOME-Bench through a rigorous four-stage workflow to evaluate two core LLM capabilities: 1) Biomolecular Interaction Inference, and 2) end-to-end Multi-Omics Pathway Mechanism Elucidation. Created evaluation protocols for both tasks and conducted comprehensive experiments across multiple contemporary LLMs.

Result: Experimental results show existing models exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and generate faithful, robust pathway-level mechanistic explanations.

Conclusion: BIOME-Bench addresses the critical need for standardized evaluation in multi-omics pathway analysis, revealing significant gaps in current LLM capabilities and providing a foundation for reproducible progress in this important biomedical domain.

Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.

[56] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

Mohammad Zia Ur Rehman, Velpuru Navya, Sanskar, Shuja Uddin Qureshi, Nagendra Kumar

Main category: cs.CL

TL;DR: Semi-SMDNet: A semi-supervised multilingual depression detection network using teacher-student pseudo-labeling, ensemble learning, and data augmentation to overcome language style variations and limited annotated data across languages.

Details

Motivation: Depression detection from social media text is challenging due to different language styles, informal expressions, and lack of annotated data in many languages, creating a need for scalable cross-language solutions with limited labeled resources.

Method: Proposes Semi-SMDNet combining teacher-student pseudo-labeling, ensemble learning, and data augmentation. Uses multiple teacher models with soft voting, uncertainty-based threshold filtering to remove low-confidence pseudo-labels, and confidence-weighted training focusing on reliable samples.

Result: Outperforms strong baselines on Arabic, Bangla, English, and Spanish datasets, significantly reducing performance gap between resource-rich and resource-poor settings, demonstrating effectiveness across various situations.

Conclusion: The framework is effective for scalable cross-language mental health monitoring where labeled resources are limited, showing robustness across languages and suitability for real-world applications with limited annotation.

Abstract: Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.

[57] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Ákos Prucs, Márton Csutora, Mátyás Antal, Márk Marosi

Main category: cs.CL

TL;DR: This paper evaluates LLMs with test-time compute awareness, finding MoE architectures balance performance/efficiency best, identifying compute saturation points where extended reasoning yields diminishing returns.

Details

Motivation: Current literature overlooks the computational burden of generating long reasoning sequences in LLMs. Industrial applications require balancing accuracy with resource constraints and inference costs, not just raw performance.

Method: Conducted test-time-compute aware evaluation of contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Analyzed Mixture of Experts (MoE) architecture and traced Pareto efficiency trajectory over time.

Result: MoE architecture emerges as strong candidate for balancing performance and efficiency. Identified compute saturation point where accuracy gains diminish beyond certain threshold. Derived emergent trend regarding accuracy gain per unit of compute.

Conclusion: While extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities. There’s a saturation point for inference-time compute beyond which accuracy gains diminish significantly.

Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

[58] Practising responsibility: Ethics in NLP as a hands-on course

Malvina Nissim, Viviana Patti, Beatrice Savoldi

Main category: cs.CL

TL;DR: A course on Ethical Aspects in NLP that uses active learning methods to integrate ethics into NLP education, refined over four years across different institutions and producing reusable teaching materials.

Details

Motivation: As NLP systems become more pervasive, there's a growing need to integrate ethical considerations into NLP education. However, curriculum development faces challenges due to the field's rapid evolution and the need to foster critical thinking beyond traditional technical training.

Method: The course uses a pedagogical approach grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. It has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds over four years.

Result: The course has yielded many reusable products, both in the form of teaching materials and actual educational products aimed at diverse audiences, created by the students themselves. It has been successfully implemented across various educational contexts.

Conclusion: By sharing their approach and experience, the authors hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula, demonstrating a successful model for integrating ethics into NLP education.

Abstract: As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field’s rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

[59] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Yanan Long

Main category: cs.CL

TL;DR: The paper introduces “triangulation” - a causal standard for evaluating mechanistic explanations in multilingual language models that requires necessity, sufficiency, and invariance across language variants.

Details

Motivation: Multilingual language models show unpredictable behavior across languages, scripts, and cultures, and current mechanistic explanations lack causal rigor and cross-lingual validation.

Method: Formalizes “reference families” as predicate-preserving variants, introduces “triangulation” acceptance rule (necessity, sufficiency, invariance), uses automatic circuit discovery with triangulation testing, and grounds in causal abstraction theory.

Result: Triangulation provides a falsifiable standard that filters spurious circuits that pass single-environment tests but fail cross-lingual invariance, enabling more robust mechanistic explanations.

Conclusion: The triangulation framework establishes a causal standard for mechanistic interpretability that ensures explanations survive interventions and cross-reference across language variants, advancing pragmatic interpretability research.

Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

[60] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Srija Mukhopadhyay, Sathwik Reddy, Shruthi Muthukumar, Jisun An, Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: PrivacyBench benchmark reveals RAG assistants leak user secrets in up to 26.56% of interactions, highlighting critical privacy risks in personalized AI systems.

Details

Motivation: Personalized AI agents access sensitive user data (emails, chats, purchase histories), creating fundamental privacy risks. Systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being and making current architectures unsafe for wide-scale deployment.

Method: Introduced PrivacyBench benchmark with socially grounded datasets containing embedded secrets and multi-turn conversational evaluation to measure secret preservation. Tested Retrieval-Augmented Generation (RAG) assistants and evaluated privacy-aware prompting as a mitigation strategy.

Result: RAG assistants leak secrets in up to 26.56% of interactions. Privacy-aware prompting reduces leakage to 5.12%, but retrieval mechanisms continue to access sensitive data indiscriminately, creating a single point of failure where privacy preservation burden falls entirely on the generator.

Conclusion: Current AI architectures are unsafe for wide-scale deployment due to fundamental privacy vulnerabilities. There’s an urgent need for structural, privacy-by-design safeguards rather than partial mitigations like prompting, to ensure ethical and inclusive web systems.

Abstract: Personalized AI agents rely on access to a user’s digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.

[61] Big AI is accelerating the metacrisis: What can we do?

Steven Bird

Main category: cs.CL

TL;DR: The paper critiques how big AI and language engineers are accelerating ecological, meaning, and language crises, calling for alternatives focused on human flourishing and planetary wellbeing.

Details

Motivation: The world faces converging ecological, meaning, and language crises (metacrisis), exacerbated by big AI. Language engineers are complicit by supporting harmful scalability narratives, enabling plutocrats/kleptocrats, and treating technology as value-neutral, despite the urgent need for alternatives.

Method: The paper appears to be a critical analysis and call to action rather than presenting a specific technical method. It advocates for exploring alternatives through collective intelligence to redesign NLP’s future direction.

Result: The analysis reveals how current NLP/AI practices contribute to multiple crises and calls for a fundamental reorientation of the field toward life-affirming values and human flourishing on a living planet.

Conclusion: NLP must urgently shift from its current trajectory that accelerates crises toward designing alternatives centered on human flourishing and planetary wellbeing, requiring collective intelligence and value-conscious technological development.

Abstract: The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.

[62] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, Jiajun Zhang

Main category: cs.CL

TL;DR: Encyclo-K is a statement-based benchmark that uses knowledge statements from textbooks as the unit of curation, dynamically composing them into evaluation questions to address data contamination, enable multi-knowledge assessment, and reduce annotation costs.

Details

Motivation: Existing LLM benchmarks have three key limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. The authors aim to create a more robust, comprehensive, and scalable evaluation framework.

Method: Extract standalone knowledge statements from authoritative textbooks, then dynamically compose them into evaluation questions through random sampling at test time. Each question aggregates 8-10 statements for comprehensive assessment. Annotators only verify formatting compliance without requiring domain expertise.

Result: Experiments on 50+ LLMs show strong discriminative power: top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy. Reasoning models range from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. Model rankings remain stable across dynamically generated question sets.

Conclusion: Encyclo-K successfully addresses the three limitations of existing benchmarks and establishes a scalable framework for dynamic evaluation of LLMs’ comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution–reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs’ comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

[63] mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang

Main category: cs.CL

TL;DR: mHC restores identity mapping in Hyper-Connections by projecting residual space onto a manifold, solving training instability and memory overhead while improving scalability.

Details

Motivation: Hyper-Connections (HC) improve performance but lose the identity mapping property of residual connections, causing training instability, restricted scalability, and memory access overhead.

Method: Manifold-Constrained Hyper-Connections (mHC) projects the residual connection space of HC onto a specific manifold to restore identity mapping, with rigorous infrastructure optimization for efficiency.

Result: mHC enables effective training at scale with tangible performance improvements and superior scalability compared to HC.

Conclusion: mHC is a flexible, practical extension of HC that contributes to understanding topological architecture design and suggests promising directions for foundational model evolution.

Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

[64] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi

Main category: cs.CL

TL;DR: ADOPT is a framework for optimizing prompts in multi-step LLM pipelines by modeling dependencies between steps and final outcomes, enabling precise gradient estimation and adaptive resource allocation.

Details

Motivation: Multi-step LLM pipelines are effective for complex tasks but their performance heavily depends on prompts at each step. Joint optimization is difficult due to missing step-level supervision and inter-step dependencies, and existing end-to-end methods yield suboptimal or unstable results.

Method: ADOPT explicitly models dependencies between each LLM step and final task outcome to enable precise text-gradient estimation (analogous to computing analytical derivatives). It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and uses Shapley-based mechanism for adaptive resource allocation.

Result: Experiments on real-world datasets and diverse pipeline structures show ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

Conclusion: ADOPT provides a novel framework for optimizing multi-step LLM pipelines that addresses the challenges of missing step-level supervision and inter-step dependencies through dependency modeling and adaptive resource allocation.

Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

[65] Classifying long legal documents using short random chunks

Luis Adrián Cabrera-Diego

Main category: cs.CL

TL;DR: Legal document classifier using DeBERTa V3 + LSTM on random 48 chunks (max 128 tokens) with Temporal deployment pipeline achieves 0.898 weighted F-score.

Details

Motivation: Legal documents are challenging to classify due to specialized vocabulary and long length, making full-document Transformer processing impossible, expensive, or slow.

Method: Proposes a classifier combining DeBERTa V3 with LSTM that processes 48 randomly-selected short chunks (max 128 tokens) per document, deployed using Temporal for durable execution workflow.

Result: Best model achieved weighted F-score of 0.898, with pipeline processing median time of 498 seconds per 100 files on CPU.

Conclusion: The chunk-based approach with DeBERTa V3 + LSTM effectively handles long legal documents, and Temporal deployment provides reliable processing workflow with reasonable performance.

Abstract: Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

[66] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Siddhant Agarwal, Adya Dhuler, Polly Ruhnke, Melvin Speisman, Md Shad Akhtar, Shweta Yadav

Main category: cs.CL

TL;DR: RESTOREx introduces LLM-generated and human-annotated explanations for detecting depressive symptoms in memes, with MAMAMemeia framework achieving 7.55% improvement over SOTA.

Details

Motivation: Memes have evolved from humor to expressing various emotions, including depressive sentiments. With growing use of memes for expressing depression on social media, there's a need to identify depressive symptoms in memes shared by users.

Method: Introduces RESTOREx resource with LLM-generated and human-annotated explanations for depressive symptom detection. Proposes MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework based on Cognitive Analytic Therapy (CAT) Competencies.

Result: MAMAMemeia improves upon current state-of-the-art by 7.55% in macro-F1 score and establishes a new benchmark compared to over 30 existing methods.

Conclusion: The paper presents an effective framework for detecting depressive symptoms in memes using clinical psychology principles and multi-agent collaboration, significantly advancing the state-of-the-art in this domain.

Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

[67] Modeling Language as a Sequence of Thoughts

Nasim Borazjanizadeh, James McClelland

Main category: cs.CL

TL;DR: Thought Gestalt (TG) is a recurrent Transformer that models language at both token and sentence-level “thought” states, improving efficiency and relational reasoning over standard GPT-2 models.

Details

Motivation: Standard Transformer language models rely too much on surface-level co-occurrence statistics, failing to form globally consistent latent representations of entities and events. This leads to brittleness in relational reasoning (like reversal curse), contextualization errors, and data inefficiency. Human comprehension involves converting linguistic input into compact, event-like representations that persist in memory.

Method: TG is a recurrent Transformer that models language at two levels: tokens and sentence-level “thought” states. It generates tokens one sentence at a time while cross-attending to a memory of prior sentence representations. Both token and sentence representations use the same parameters and are trained with a single next-token cross-entropy objective, allowing gradients from future token losses to optimize earlier sentence vectors.

Result: TG consistently improves efficiency over matched GPT-2 runs, with scaling fits showing GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG’s loss. TG also reduces errors on relational direction generalization tasks, specifically on father-son reversal curse probes.

Conclusion: The Thought Gestalt model demonstrates that modeling language at both token and sentence-level abstractions, inspired by human cognitive processes, leads to more efficient and robust language models with better relational reasoning capabilities.

Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level “thought” states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG’s loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

[68] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Chao Peng, Bin Wang, Zhilei Long, Jinfang Sheng

Main category: cs.CL

TL;DR: AdaGReS is a redundancy-aware context selection framework for RAG that optimizes relevance while minimizing redundancy under token budget constraints, with adaptive parameter calibration and theoretical guarantees.

Details

Motivation: Standard top-k retrieval in RAG often returns redundant or near-duplicate chunks, wasting token budget and degrading generation quality. There's a need for better context selection that balances relevance and redundancy within token constraints.

Method: AdaGReS uses greedy selection under token-budget constraints with marginal gains from an objective combining query-chunk relevance and intra-set redundancy penalties. It introduces closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits.

Result: Experiments on open-domain QA (Natural Questions) and high-redundancy biomedical corpus show consistent improvements in redundancy control and context quality, leading to better end-to-end answer quality and robustness across settings.

Conclusion: AdaGReS provides an effective, theoretically-grounded solution for redundancy-aware context selection in RAG with adaptive parameter calibration, improving both efficiency and generation quality without manual tuning.

Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

[69] CascadeNS: Confidence-Cascaded Neurosymbolic Model for Sarcasm Detection

Swapnil Mane, Vaibhav Khatavkar

Main category: cs.CL

TL;DR: CascadeNS is a confidence-calibrated neurosymbolic architecture for sarcasm detection that selectively activates symbolic or neural reasoning based on confidence scores, outperforming baselines by 7.44%.

Details

Motivation: Current sarcasm detection methods struggle to balance interpretable symbolic pattern recognition with deep semantic understanding. Existing approaches either favor one paradigm or use suboptimal fusion/ensembling methods that degrade performance.

Method: CascadeNS uses a calibrated confidence measure from polarity-weighted semigraph scores to selectively activate reasoning: symbolic semigraph handles pattern-rich cases with high confidence, while ambiguous cases go to neural module with pre-trained LLM embeddings.

Result: Experiments on product reviews show CascadeNS outperforms strong baselines by 7.44%.

Conclusion: The confidence-calibrated selective activation approach effectively integrates symbolic and neural reasoning for sarcasm detection, achieving superior performance over fusion/ensembling methods.

Abstract: Sarcasm detection in product reviews requires balancing domain-specific symbolic pattern recognition with deep semantic understanding. Symbolic representations capture explicit linguistic phenomena that are often decisive for sarcasm detection. Existing work either favors interpretable symbolic representation or semantic neural modeling, but rarely achieves both effectively. Prior hybrid methods typically combine these paradigms through feature fusion or ensembling, which can degrade performance. We propose CascadeNS, a confidence-calibrated neurosymbolic architecture that integrates symbolic and neural reasoning through selective activation rather than fusion. A symbolic semigraph handles pattern-rich instances with high confidence, while semantically ambiguous cases are delegated to a neural module based on pre-trained LLM embeddings. At the core of CascadeNS is a calibrated confidence measure derived from polarity-weighted semigraph scores. This measure reliably determines when symbolic reasoning is sufficient and when neural analysis is needed. Experiments on product reviews show that CascadeNS outperforms the strong baselines by 7.44%.

[70] LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

Weizhi Tang, Kwabena Nuamah, Vaishak Belle

Main category: cs.CL

TL;DR: The paper introduces LTLBench, a dataset of 2000 temporal reasoning challenges automatically synthesized using Linear Temporal Logic to benchmark LLMs’ temporal reasoning abilities across 12 models and 5 methods.

Details

Motivation: Existing temporal reasoning evaluation methods for LLMs have limitations, so the authors propose using Linear Temporal Logic (LTL) as an alternative perspective to systematically assess temporal reasoning abilities through automatically generated challenges.

Method: Developed a pipeline to automatically synthesize temporal reasoning challenges using Linear Temporal Logic (LTL), constructed a dataset of 2000 challenges, benchmarked 12 LLMs across 5 methods, and analyzed the impact of formula complexity and event count.

Result: Benchmarked 12 LLMs, identified 3 main issues in their temporal reasoning processes, and observed unexpected performance changes as problem complexity increases (more formula operators and events).

Conclusion: The LTL-based evaluation approach provides valuable insights into LLMs’ temporal reasoning abilities, revealing systematic issues and complexity effects that can guide future research and model development.

Abstract: Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely \LTL, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.

[71] Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering

Daehwan Nam, Gary Geunbae Lee

Main category: cs.CL

TL;DR: A grammar-based semantic parsing method enhanced with candidate expressions from knowledge bases to improve accuracy and decoding speed for knowledge base question answering.

Details

Motivation: Current semantic parsers use grammars for constrained decoding but lack ability to incorporate KB information, even though logical forms contain KB elements like entities and relations.

Method: Proposes grammar augmented with candidate expressions from KBs, with special rules for sub-type inference and union types, plus mask caching algorithm for faster decoding.

Result: Improved accuracy on KQA Pro and Overnight benchmarks with both strong and weak supervision, plus significantly faster decoding speed.

Conclusion: Candidate expression constraints enhance semantic parsing accuracy and efficiency, making KB-aware grammar a valuable approach for semantic parsing on large knowledge bases.

Abstract: Semantic parsers convert natural language to logical forms, which can be evaluated on knowledge bases (KBs) to produce denotations. Recent semantic parsers have been developed with sequence-to-sequence (seq2seq) pre-trained language models (PLMs) or large language models, where the models treat logical forms as sequences of tokens. For syntactic and semantic validity, the semantic parsers use grammars that enable constrained decoding. However, the grammars lack the ability to utilize large information of KBs, although logical forms contain representations of KB elements, such as entities or relations. In this work, we propose a grammar augmented with candidate expressions for semantic parsing on a large KB with a seq2seq PLM. The grammar defines actions as production rules, and our semantic parser predicts actions during inference under the constraints by types and candidate expressions. We apply the grammar to knowledge base question answering, where the constraints by candidate expressions assist a semantic parser to generate valid KB elements. We also introduce two special rules, sub-type inference and union types, and a mask caching algorithm. In particular, sub-type inference and the mask caching algorithm greatly increase the decoding speed of our semantic parser. We experimented on two benchmarks, KQA Pro and Overnight, where the constraints by candidate expressions increased the accuracy of our semantic parser, whether it was trained with strong supervision or weak supervision. In addition, our semantic parser had a fast decoding speed in the experiments. Our source code is publicly available at https://github.com/daehwannam/candexpr-sp.git.

[72] Automatic identification of diagnosis from hospital discharge letters via weakly-supervised Natural Language Processing

Vittorio Torri, Elisa Barbieri, Anna Cantarutti, Carlo Giaquinto, Francesca Ieva

Main category: cs.CL

TL;DR: Weakly-supervised NLP pipeline for classifying Italian discharge letters without manual labeling, achieving 77.68% AUC and 78.14% F1-score for bronchiolitis detection.

Details

Motivation: Traditional supervised approaches for patient diagnosis identification from discharge letters require extensive manual annotation, which is impractical for large textual datasets. There's a need for scalable solutions that reduce annotation burden while maintaining accuracy.

Method: 1. Extract diagnosis-related sentences from discharge letters. 2. Use transformer-based model with additional Italian medical document pre-training to generate semantic embeddings. 3. Apply two-level clustering to embeddings. 4. Map clusters to diseases of interest to derive weak labels. 5. Train transformer-based classifier using weakly labeled subset.

Result: Achieved 77.68% AUC (±4.30%) and 78.14% F1-score (±4.89%) for bronchiolitis detection on 33,176 Italian discharge letters. Performance surpasses unsupervised methods and approaches fully supervised models. Saves ~3 minutes per discharge letter (1,500+ hours total). Robust to cluster selection and generalizable across diseases.

Conclusion: The weakly-supervised strategy is feasible for identifying diagnoses from Italian discharge letters. The pipeline offers strong performance, adaptability to various diseases, and scalable clinical text classification while reducing manual annotation needs.

Abstract: Identifying patient diagnoses from discharge letters is essential to enable large-scale cohort selection and epidemiological research, but traditional supervised approaches rely on extensive manual annotation, which is often impractical for large textual datasets. In this study, we present a novel weakly-supervised Natural Language Processing pipeline designed to classify Italian discharge letters without requiring manual labelling. After extracting diagnosis-related sentences, the method leverages a transformer-based model with an additional pre-training on Italian medical documents to generate semantic embeddings. A two-level clustering procedure is applied to these embeddings, and the resulting clusters are mapped to the diseases of interest to derive weak labels for a subset of data, eventually used to train a transformer-based classifier. We evaluate the approach on a real-world case study on bronchiolitis in a corpus of 33,176 Italian discharge letters of children admitted to 44 emergency rooms or hospitals in the Veneto Region between 2017 and 2020. The pipeline achieves an area under the curve (AUC) of 77.68% ($\pm 4.30%)$ and an F1-score of 78.14% ($\pm 4.89%$) against manual annotations. Its performance surpasses other unsupervised methods and approaches fully supervised models, maintaining robustness to cluster selection and promising generalizability across different disease types. It allows saving approximately 3 minutes of expert time per discharge letter, resulting in more than 1,500 hours for a dataset like ours. This study demonstrates the feasibility of a weakly-supervised strategy for identifying diagnoses from Italian discharge letters. The pipeline achieves strong performance, is adaptable to various diseases, and offers a scalable solution for clinical text classification, reducing the need for manual annotation while maintaining reliable accuracy.

[73] Bielik 7B v0.1: A Polish Language Model – Development, Insights, and Evaluation

Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas

Main category: cs.CL

TL;DR: Bielik 7B v0.1 is a 7B-parameter Polish language model with novel training techniques that achieves state-of-the-art performance on Polish NLP benchmarks.

Details

Motivation: To develop a high-performance generative text model specifically for Polish language processing, addressing the need for advanced AI tools in Polish NLP applications.

Method: Trained on curated Polish corpora using innovative techniques: Weighted Instruction Cross-Entropy Loss (balances different instruction types) and Adaptive Learning Rate (dynamically adjusts based on training progress).

Result: Achieves 9 percentage point improvement over Mistral-7B-v0.1 on RAG Reader task; scores 6.15/10 in Reasoning and 7.83/10 in Role-playing on Polish MT-Bench; establishes new benchmarks through Open PL LLM Leaderboard.

Conclusion: Bielik 7B v0.1 represents a significant advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new performance benchmarks in the field.

Abstract: We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in language model development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.

[74] Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

Maria Paola Priola

Main category: cs.CL

TL;DR: Combines RAG for hallucination mitigation with NMISS for detection, showing GPT-4 and Gemma2 perform best, while NMISS helps mid-tier models by better evaluating contextually accurate responses.

Details

Motivation: Address hallucinations in LLMs by combining detection and mitigation techniques, particularly important for healthcare applications where accuracy is critical.

Method: Uses RAG framework for mitigation (grounding answers in external data) and introduces NMISS (Negative Missing Information Scoring System) for detection that accounts for contextual relevance.

Result: Gemma2 and GPT-4 outperform other models, with GPT-4 answers closely aligned with reference responses. NMISS benefits mid-tier models (Llama2, Llama3, Mistral) by revealing their ability to provide richer contextual information.

Conclusion: Combined RAG+NMISS approach offers new insights for reducing and more accurately assessing hallucinations in LLMs, with practical applications in healthcare and other domains.

Abstract: I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.

[75] Quantifying Positional Biases in Text Embedding Models

Reagan J. Lee, Samarth Goel, Kannan Ramchandran

Main category: cs.CL

TL;DR: Embedding models show strong positional bias, prioritizing text at the beginning of inputs regardless of content relevance, with up to 12.3% greater impact from changes at the start versus end.

Details

Motivation: Embedding models are essential for IR and semantic similarity tasks, but their handling of longer texts and potential positional biases remains underexplored, particularly how content position and input size affect text embeddings.

Method: Conducted experiments with ablation studies (inserting irrelevant text or removing text at different positions), regression analysis to measure sentence importance by position, and examined models with different positional encoding mechanisms.

Result: Embedding models disproportionately prioritize the beginning of inputs regardless of positional encoding. Changes at the start reduce cosine similarity by up to 12.3% more than changes at the end. Sentence importance declines as position moves further from the start, even with content-agnostic approaches.

Conclusion: The findings quantify retrieval system sensitivity to positional bias and suggest this provides a new lens for evaluating embedding model robustness, with biases potentially arising from pre-processing strategies and positional encoding techniques.

Abstract: Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards embedding model robustness.

[76] Large Multimodal Models for Low-Resource Languages: A Survey

Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: Survey paper analyzing techniques for adapting large multimodal models to low-resource languages, covering visual enhancement, data creation, cross-modal transfer, and fusion strategies across 117 studies and 96 languages.

Details

Motivation: To systematically understand how researchers adapt large multimodal models for low-resource languages, addressing challenges of limited data and computational resources, and to provide a comprehensive overview of current approaches and remaining gaps.

Method: Comprehensive analysis of 117 studies across 96 low-resource languages, categorizing works into resource-oriented and method-oriented contributions with relevant sub-categories, comparing performance and efficiency of different approaches.

Result: Visual information serves as a crucial bridge for improving model performance in low-resource settings, though significant challenges remain in hallucination mitigation and computational efficiency. The survey identifies key patterns in how researchers tackle limited data and resource constraints.

Conclusion: Provides researchers with clear understanding of current approaches and remaining challenges in making large multimodal models accessible to speakers of low-resource languages, complemented by an open-source repository for ongoing research.

Abstract: In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

[77] ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim

Main category: cs.CL

TL;DR: Visual Instruction Rewriting transforms multimodal (vision+text) instructions into text-only commands using lightweight on-device VLMs, enabling privacy-preserving multimodal AI without sending sensitive visual data to the cloud.

Details

Motivation: As AR, VR, and camera-equipped devices become primary interfaces, there's a need for privacy-preserving multimodal interaction. Current cloud-based VLMs raise concerns about visual privacy (transmitting sensitive vision data) and lack real-time on-device usability.

Method: Developed Visual Instruction Rewriting approach using a compact VLM (250M parameters). Created dataset of 39,000+ examples across 14 domains, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Used quantization to reduce storage footprint (<500MB).

Result: Quantized model (<500MB) achieves effective instruction rewriting as measured by NLG metrics (BLEU, METEOR, ROUGE) and semantic parsing analysis. Enables privacy-focused multimodal applications by keeping vision data on-device.

Conclusion: Visual Instruction Rewriting enables privacy-preserving multimodal AI by transforming vision-based instructions into text-only commands using lightweight on-device models, addressing both privacy concerns and real-time usability limitations of cloud-based VLMs.

Abstract: Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

[78] A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Xiaoye Qu, Yafu Li, Zhao-Chen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng

Main category: cs.CL

TL;DR: Survey on improving reasoning efficiency in Large Reasoning Models (LRMs) that addresses their tendency to produce excessively long, redundant reasoning traces, which creates challenges for training, inference, and deployment.

Details

Motivation: Recent LRMs like DeepSeek-R1 and OpenAI o1 show strong performance but generate overly long reasoning traces with redundant content, over-analysis of simple problems, and superficial exploration of multiple paths. This inefficiency creates significant challenges for training, inference, and real-world deployment where token economy is critical.

Method: Comprehensive survey approach examining: 1) Common patterns of inefficiency in LRMs, 2) Methods proposed across the entire LRM lifecycle (from pretraining to inference), 3) Analysis of unique challenges in this new paradigm, 4) Maintenance of real-time GitHub repository tracking progress.

Result: The survey provides a structured overview of recent efforts to improve reasoning efficiency in LRMs, identifies key inefficiency patterns, examines lifecycle approaches, and establishes a foundation for future research with ongoing tracking via GitHub repository.

Conclusion: This survey serves as a foundation for further exploration and aims to inspire innovation in improving reasoning efficiency in rapidly evolving Large Reasoning Models, addressing critical token economy challenges for practical deployment.

Abstract: Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.

[79] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Ding Chen, Qingchen Yu, Pengyuan Wang, Mengting Hu, Wentao Zhang, Zhengren Wang, Bo Tang, Feiyu Xiong, Xinchi Li, Chao Wang, Minchuan Yang, Zhiyu Li

Main category: cs.CL

TL;DR: xVerify is an efficient answer verifier for evaluating reasoning models that addresses limitations of existing evaluation methods in judging answer equivalence and extracting final answers from complex reasoning outputs.

Details

Motivation: Existing evaluation methods and reward models are inadequate for reasoning models that produce complex outputs with intermediate steps and self-reflection, struggling with answer equivalence judgment and reliable answer extraction from long responses.

Method: Propose xVerify answer verifier trained on VAR dataset (question-answer pairs from multiple LLMs across diverse datasets), with multi-round annotation for quality. Train xVerify models at different scales (0.5B to 3B parameters).

Result: All xVerify variants achieve over 95% F1 score and accuracy. xVerify-0.5B-I outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o overall. As reward model in RL, yields 18.4% improvement for Qwen2.5-7B vs direct generation.

Conclusion: xVerify demonstrates strong equivalence judgment capabilities and generalizability for evaluating reasoning models, with open-source resources available on GitHub.

Abstract: With the release of OpenAI’s o1 model, reasoning models that adopt slow-thinking strategies have become increasingly common. Their outputs often contain complex reasoning, intermediate steps, and self-reflection, making existing evaluation methods and reward models inadequate. In particular, they struggle to judge answer equivalence and to reliably extract final answers from long, complex responses. To address this challenge, we propose xVerify, an efficient answer verifier for evaluating reasoning models. xVerify shows strong equivalence judgment capabilities, enabling accurate comparison between model outputs and reference answers across diverse question types. To train and evaluate xVerify, we construct the VAR dataset, which consists of question-answer pairs generated by multiple LLMs across various datasets. The dataset incorporates multiple reasoning models and challenging evaluation sets specifically designed for reasoning assessment, with a multi-round annotation process to ensure label quality. Based on VAR, we train xVerify models at different scales. Experimental results on both test and generalization sets show that all xVerify variants achieve over 95% F1 score and accuracy. Notably, the smallest model, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. In addition, reinforcement learning experiments using xVerify as the reward model yield an 18.4% improvement for Qwen2.5-7B compared with direct generation, exceeding the gains achieved with Math Verify as the reward. These results demonstrate the effectiveness and generalizability of xVerify. All xVerify resources are available on \href{https://github.com/IAAR-Shanghai/xVerify}{GitHub}.

[80] Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, Yue Zhang

Main category: cs.CL

TL;DR: Pre-DPO improves DPO and SimPO by using a reference model as a data weight adjuster to guide training, avoiding performance ceilings and catastrophic forgetting.

Details

Motivation: Standard DPO initializes policy and reference models identically, leading to inefficient data utilization and performance ceilings. SimPO lacks a reference model entirely, reducing robustness and risking catastrophic forgetting.

Method: Pre-DPO introduces a guiding reference model that provides foresight into optimal policy states achievable through training data. This model adaptively assigns higher weights to suitable samples and lower weights to unsuitable ones, acting as a data weight adjuster.

Result: Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks show Pre-DPO consistently improves performance of both DPO and SimPO without needing external models or additional data.

Conclusion: Pre-DPO offers a simple yet effective enhancement to preference optimization by leveraging reference models as adaptive data weight adjusters, overcoming limitations of both DPO and SimPO approaches.

Abstract: Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

[81] An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

Matan Orbach, Ohad Eytan, Benjamin Sznajder, Ariel Gera, Odellia Boni, Yoav Kantor, Gal Bloch, Omri Levy, Hadas Abraham, Nitzan Barzilay, Eyal Shnarch, Michael E. Factor, Shila Ofek-Koifman, Paula Ta-Shma, Assaf Toledo

Main category: cs.CL

TL;DR: RAG hyperparameter optimization can be done efficiently with greedy or random search, significantly boosting performance across diverse datasets.

Details

Motivation: Optimizing RAG configurations is complex and resource-intensive, but existing HPO frameworks lack rigorous benchmarking, creating a gap in understanding their effectiveness.

Method: Comprehensive study with five HPO algorithms over five diverse datasets, including a new real-world product documentation dataset. Used largest RAG HPO search space with full grid-search evaluations and three evaluation metrics as optimization targets.

Result: RAG HPO can be done efficiently with greedy or random search approaches, significantly boosting RAG performance across all datasets. For greedy HPO, optimizing model selection first outperforms following the RAG pipeline order.

Conclusion: RAG hyperparameter optimization is effective and efficient, with greedy approaches benefiting from prioritizing model selection over pipeline-order optimization.

Abstract: Optimizing Retrieval-Augmented Generation (RAG) configurations for specific tasks is a complex and resource-intensive challenge. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To fill this gap, we present a comprehensive study involving five HPO algorithms over five datasets from diverse domains, including a newly curated real-world product documentation dataset. Our study explores the largest RAG HPO search space to date that includes full grid-search evaluations, and uses three evaluation metrics as optimization targets. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing model selection first is preferable to the common practice of following the RAG pipeline order during optimization.

[82] MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa

Main category: cs.CL

TL;DR: Researchers introduce two manga understanding benchmarks (MangaOCR for text recognition and MangaVQA for contextual Q&A) and develop MangaLMM, a specialized model that outperforms proprietary models like GPT-4o and Gemini 2.5 on manga comprehension tasks.

Details

Motivation: Manga is a complex multimodal narrative form blending images and text. Teaching large multimodal models to understand manga at human-like levels could help manga creators reflect on and refine their stories, but current LMMs lack specialized capabilities for this domain.

Method: 1) Create two benchmarks: MangaOCR for in-page text recognition and MangaVQA (526 manually constructed QA pairs) for contextual understanding. 2) Develop MangaLMM by finetuning Qwen2.5-VL to jointly handle both tasks. 3) Conduct extensive experiments comparing with proprietary models like GPT-4o and Gemini 2.5.

Result: MangaVQA provides a reliable evaluation benchmark across diverse narrative scenarios. MangaLMM demonstrates strong performance on manga understanding tasks, showing competitive results compared to proprietary models. The benchmarks enable comprehensive assessment of LMM capabilities in manga domain.

Conclusion: The introduced benchmarks and specialized model provide a comprehensive foundation for evaluating and advancing large multimodal models in the richly narrative domain of manga, addressing the unique challenges of multimodal narrative understanding.

Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

[83] Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs

Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken

Main category: cs.CL

TL;DR: A novel atomic fact-checking framework improves LLM reliability in medical Q&A by decomposing responses into verifiable atomic facts and checking them against authoritative medical guidelines, achieving 40% answer improvement and 50% hallucination detection.

Details

Motivation: LLMs have extensive medical knowledge but suffer from hallucinations and inaccurate citations, posing challenges for clinical adoption and regulatory compliance. Current methods like RAG partially address these issues but still have problems with hallucinations and low fact-level explainability.

Method: Introduces an atomic fact-checking framework that decomposes LLM-generated responses into discrete, verifiable units called atomic facts. Each atomic fact is independently verified against an authoritative knowledge base of medical guidelines. This enables targeted error correction and direct tracing to source literature.

Result: Extensive evaluation using multi-reader assessments by medical experts and automated open Q&A benchmarks showed significant improvements in factual accuracy and explainability. The framework achieved up to 40% overall answer improvement and 50% hallucination detection rate. Each atomic fact can be traced back to relevant database chunks, providing granular transparency.

Conclusion: This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare by improving factual accuracy and explainability.

Abstract: Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.

[84] Toward Robust Legal Text Formalization into Defeasible Deontic Logic using LLMs

Elias Horner, Cristinel Mateis, Guido Governatori, Agata Ciabattoni

Main category: cs.CL

TL;DR: Automated formalization of legal texts into Defeasible Deontic Logic using LLMs with improved metrics and pipeline.

Details

Motivation: To develop scalable automated methods for transforming complex legal texts into formal logical representations, addressing the challenge of legal informatics and enabling computational reasoning about legal norms.

Method: Structured pipeline that segments legal texts into atomic snippets, extracts deontic rules, evaluates syntactic/semantic coherence, with a novel two-stage approach including refinement step for logical consistency. Uses multiple LLM configurations with various prompting and fine-tuning strategies.

Result: LLMs can produce formalizations closely aligned with expert-crafted representations when effectively guided, as demonstrated on Australian Telecommunications Consumer Protections Code norms. Comparative results across LLM configurations show improved performance with refined metrics.

Conclusion: LLMs show significant potential for scalable legal informatics through automated formalization of legal texts into Defeasible Deontic Logic, with the refined pipeline and metrics enabling more accurate and consistent logical representations.

Abstract: We present a comprehensive approach to the automated formalization of legal texts using large language models (LLMs), targeting their transformation into Defeasible Deontic Logic (DDL). Our method employs a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. We introduce a refined success metric that more precisely captures the completeness of formalizations, and a novel two-stage pipeline with a dedicated refinement step to improve logical consistency and coverage. The evaluation procedure has been strengthened with stricter error assessment, and we provide comparative results across multiple LLM configurations, including newly released models and various prompting and fine-tuning strategies. Experiments on legal norms from the Australian Telecommunications Consumer Protections Code demonstrate that, when guided effectively, LLMs can produce formalizations that align closely with expert-crafted representations, underscoring their potential for scalable legal informatics.

[85] A Survey on LLM-Assisted Clinical Trial Recruitment

Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff

Main category: cs.CL

TL;DR: This survey paper analyzes trial-patient matching in clinical trial recruitment, examining LLM-based approaches, benchmarks, and challenges in adopting LLM technologies for this critical domain.

Details

Motivation: LLMs have improved general NLP tasks but adoption in critical domains like clinical trial recruitment remains limited. Trial-patient matching benefits from LLMs' knowledge aggregation and reasoning abilities, but current approaches rely on proprietary models and weak evaluation benchmarks.

Method: The paper conducts a comprehensive survey analyzing trial-patient matching as a task, contextualizing emerging LLM-based approaches, critically examining existing benchmarks, approaches, evaluation frameworks, and adoption challenges.

Result: This is the first survey to systematically analyze trial-patient matching and LLM-based approaches in clinical trial recruitment, identifying gaps in current methods and evaluation practices.

Conclusion: The survey highlights the potential of LLMs for clinical trial recruitment while identifying key challenges and exciting future directions for research in this critical domain.

Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.

[86] MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Meng Fang

Main category: cs.CL

TL;DR: MuRating is a framework that transfers English data-quality signals to create a multilingual rater for 17 languages, enabling better selection of training data for LLMs and improving performance on both English and multilingual tasks.

Details

Motivation: Existing data-quality selection methods for large language models focus almost exclusively on English, creating a gap for multilingual applications where high-quality non-English training data is needed.

Method: Aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs.

Result: Applied to web data, MuRating selects balanced subsets for pretraining a 1.2B-parameter LLaMA model, boosting average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.

Conclusion: MuRating successfully transfers English data-quality signals to multilingual contexts, improving LLM performance across languages, though challenges remain with translation fidelity, selection biases, and underrepresentation of narrative material.

Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.

[87] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut

Main category: cs.CL

TL;DR: PERK is a parameter-efficient method that uses test-time gradient updates to encode long contexts into lightweight LoRA adapters, enabling more accurate reasoning over noisy information with better scaling than prompt-based approaches.

Details

Motivation: Long-context reasoning requires identifying relevant information in extensive noisy contexts. While test-time learning can encode context into model parameters effectively, existing meta-learning methods are too memory-intensive for long contexts.

Method: PERK uses nested optimization loops during meta-training: inner loop rapidly encodes contexts into low-rank adapters (LoRA) as parameter-efficient memory modules; outer loop learns to use updated adapters to recall and reason over relevant information from encoded long contexts.

Result: PERK significantly outperforms standard prompt-based baselines: up to 90% absolute performance gains for smaller models (GPT-2) and up to 27% for larger models (Qwen-2.5-0.5B). More robust to reasoning complexity, length extrapolation, and location of relevant information.

Conclusion: PERK provides a scalable approach for long-context reasoning that is memory-intensive during training but scales efficiently at inference time compared to prompt-based methods, enabling effective test-time learning for noisy long contexts.

Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

[88] Natural Language Processing for Tigrinya: Current State and Future Directions

Fitsum Gaim, Jong C. Park

Main category: cs.CL

TL;DR: A comprehensive survey of NLP research for Tigrinya (2011-2025) analyzing over 50 studies, covering resources, models, and applications across 15 downstream tasks, revealing progress from rule-based to neural systems and identifying key challenges and future directions.

Details

Motivation: Tigrinya is spoken by millions but severely underrepresented in NLP research. This survey aims to systematically review the current state of Tigrinya NLP, document progress, and provide a roadmap for future research to address this gap.

Method: Systematic review and analysis of over 50 studies from 2011 to 2025, examining computational resources, models, and applications across fifteen downstream NLP tasks including morphological processing, POS tagging, NER, machine translation, question-answering, speech recognition, and synthesis.

Result: The analysis reveals a clear trajectory from foundational rule-based systems to modern neural architectures, with progress driven by milestones in resource creation. Key challenges include Tigrinya’s morphological complexity and resource scarcity. The survey identifies promising research directions including morphology-aware modeling, cross-lingual transfer, and community-centered resource development.

Conclusion: This comprehensive survey serves as both a reference for researchers and a roadmap for advancing Tigrinya NLP. An anthology of surveyed studies and resources is publicly available to support future research in this underrepresented language.

Abstract: Despite being spoken by millions of people, Tigrinya remains severely underrepresented in Natural Language Processing (NLP) research. This work presents a comprehensive survey of NLP research for Tigrinya, analyzing over 50 studies from 2011 to 2025. We systematically review the current state of computational resources, models, and applications across fifteen downstream tasks, including morphological processing, part-of-speech tagging, named entity recognition, machine translation, question-answering, speech recognition, and synthesis. Our analysis reveals a clear trajectory from foundational, rule-based systems to modern neural architectures, with progress consistently driven by milestones in resource creation. We identify key challenges rooted in Tigrinya’s morphological properties and resource scarcity, and highlight promising research directions, including morphology-aware modeling, cross-lingual transfer, and community-centered resource development. This work serves both as a reference for researchers and as a roadmap for advancing Tigrinya NLP. An anthology of surveyed studies and resources is publicly available.

[89] Multi-step retrieval and reasoning improves radiology question answering with large language models

Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: RaR is a multi-step retrieval and reasoning framework that improves LLM performance in radiology QA by enhancing diagnostic accuracy, reducing hallucinations, and providing clinically relevant context.

Details

Motivation: Traditional single-step RAG systems in radiology QA are limited in handling complex clinical reasoning tasks, necessitating a more sophisticated framework to improve diagnostic accuracy and factual consistency.

Method: Proposed radiology Retrieval and Reasoning (RaR) framework with multi-step retrieval and reasoning; evaluated 25 LLMs across diverse architectures, parameter scales (0.5B to >670B), and training paradigms on 104 expert-curated radiology questions from RSNA-RadioQA and ExtendedQA datasets, plus 65 unseen real-world radiology board questions.

Result: RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional RAG, with greatest gains in small-scale models; reduced hallucinations by mean 9.4%; retrieved clinically relevant context in 46% of cases; even clinically fine-tuned models showed improvements.

Conclusion: RaR enhances factuality and diagnostic accuracy in radiology QA, demonstrating retrieval remains beneficial despite embedded domain knowledge; warrants future clinical validation studies; all materials publicly available for open research and clinical translation.

Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose radiology Retrieval and Reasoning (RaR), a multi-step retrieval and reasoning framework designed to improve diagnostic accuracy, factual consistency, and clinical reliability of LLMs in radiology question answering. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. RaR significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, RaR retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from RaR (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of RaR to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full RaR framework are publicly available to support open research and clinical translation.

[90] MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

Main category: cs.CL

TL;DR: Introduces MedQARo, the first large-scale medical QA benchmark in Romanian with 105,880 QA pairs from cancer patient case summaries, enabling evaluation of LLM generalization across domains and languages.

Details

Motivation: Addresses the lack of QA datasets in specific domains and languages (particularly Romanian medical domain), which hinders development of robust AI models that can generalize across various domains and languages for achieving AGI.

Method: Constructed high-quality dataset through manual annotation by 7 specialized physicians (3,000 work hours) from 1,242 cancer patient case summaries. Evaluated 4 open-source LLMs in zero-shot and fine-tuned scenarios, plus 2 API-based LLMs (GPT-5.2 and Gemini 3 Flash) on in-domain and cross-domain test collections.

Result: Fine-tuned models significantly outperform zero-shot models, showing pretrained models fail to generalize on MedQARo. Results demonstrate importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian.

Conclusion: MedQARo fills critical gap in Romanian medical QA resources and shows that current LLMs require domain- and language-specific fine-tuning for reliable clinical applications, highlighting need for specialized benchmarks to advance AGI in healthcare.

Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

[91] In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

Seungkyu Lee, Nalim Kim, Yohan Jo

Main category: cs.CL

TL;DR: In-N-Out is an expert-annotated dataset of API graphs that improves tool agent performance for complex multi-tool queries by capturing API dependencies from documentation.

Details

Motivation: Tool agents struggle with complex tasks requiring multiple API calls in proper order, as they can't effectively identify dependencies between APIs from documentation alone.

Method: Convert API documentation into structured API graphs capturing dependencies, create In-N-Out dataset with expert annotations from real-world API benchmarks, and use it for tool retrieval and multi-tool query generation.

Result: In-N-Out nearly doubles performance compared to LLMs using documentation alone, and models fine-tuned on it close 90% of the gap to expert performance.

Conclusion: Explicit API graphs are promising for tool agents, and In-N-Out is a valuable resource for helping models learn API documentation comprehension and parameter relationships.

Abstract: Tool agents–LLM-based systems that interact with external APIs–offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We release our dataset and code at https://github.com/holi-lab/In-N-Out-API-Graph.

[92] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen, Wei Sun, Qixiang Yin, Zhixing Tan, Jiajun Zhang

Main category: cs.CL

TL;DR: ACE-RL framework uses adaptive constraint criteria and reinforcement learning to improve long-form text generation, outperforming existing methods and even GPT-4o on WritingBench.

Details

Motivation: Existing long-form generation approaches are limited by scarce high-quality training data and rely on coarse-grained metrics that don't capture nuanced, scenario-specific requirements of real-world tasks.

Method: ACE-RL decomposes instructions into fine-grained adaptive constraint criteria, designs a reward mechanism based on constraint satisfaction, and uses reinforcement learning to optimize LLMs with these fine-grained signals.

Result: ACE-RL outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, with the top-performing model surpassing GPT-4o by 8.76%.

Conclusion: ACE-RL provides a more effective training paradigm for long-form generation by converting subjective quality evaluation into constraint verification and leveraging fine-grained reinforcement learning signals.

Abstract: Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.

[93] SiDiaC: Sinhala Diachronic Corpus

Nevidu Jayatilleke, Nisansa de Silva

Main category: cs.CL

TL;DR: SiDiaC is the first comprehensive Sinhala diachronic corpus spanning 5th-20th century CE, with 58k words from 46 literary works, annotated and categorized by genre, enabling historical linguistic research for Sinhala NLP.

Details

Motivation: Sinhala lacks comprehensive diachronic resources for historical linguistic studies. There's a need for a structured corpus to enable research in lexical change, neologism tracking, historical syntax, and corpus-based lexicography for this low-resourced language.

Method: Digitized texts from National Library of Sri Lanka using Google Document AI OCR, followed by post-processing for formatting and orthography modernization. Applied filtering based on availability, authorship, copyright compliance, and data attribution. Used practices from other corpora like FarPaHC for syntactic annotation and text normalization. Implemented two-layer genre categorization system.

Result: Created SiDiaC corpus with 58k words across 46 literary works spanning 15 centuries. Corpus is annotated by written date and categorized into primary (Non-Fiction/Fiction) and secondary (Religious, History, Poetry, Language, Medical) genres. Successfully addressed challenges of limited access to rare texts and reliance on secondary date sources.

Conclusion: SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending available resources and enabling diachronic studies. It represents a crucial step forward for historical linguistic research on Sinhala, despite challenges in text accessibility and dating.

Abstract: SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.

[94] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen

Main category: cs.CL

TL;DR: LiRA is a multi-agent AI system that automates literature review writing by simulating human review processes, outperforming existing baselines in writing quality and citation accuracy.

Details

Motivation: The rapid growth of scientific publications makes comprehensive literature reviews difficult to maintain. While prior work automated retrieval and screening, the writing phase remains under-explored, particularly regarding readability and factual accuracy.

Method: LiRA uses a multi-agent collaborative workflow that emulates human literature review processes, with specialized agents for content outlining, subsection writing, editing, and reviewing to produce cohesive review articles.

Result: LiRA outperforms current baselines (AutoSurvey and MASS-Survey) in writing and citation quality on SciReviewGen and ScienceDirect datasets, while maintaining competitive similarity to human-written reviews. It shows robustness to reviewer model variation and performs well in real-world retrieval scenarios.

Conclusion: Agentic LLM workflows, even without domain-specific tuning, have significant potential to improve the reliability and usability of automated scientific writing for literature reviews.

Abstract: The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.

[95] Large Language Model Sourcing: A Survey

Liang Pang, Jia Gu, Sunhao Dai, Zihao Wei, Zenghao Duan, Kangxi Wu, Zhiyi Yin, Jun Xu, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: Survey paper on LLM transparency through multi-dimensional sourcing approaches for addressing hallucinations, bias, and copyright issues.

Details

Motivation: LLMs are black-box systems with serious issues including hallucinations, bias, unfairness, and copyright infringement. Their realistic outputs make these problems significant, requiring multi-perspective information sourcing to enhance transparency and trustworthiness.

Method: Proposes a systematic investigation organized around four dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Introduces a dual-paradigm taxonomy classifying methods as prior-based (proactive traceability embedding) and posterior-based (retrospective inference).

Result: The survey provides a comprehensive framework for understanding sourcing approaches that enhance traceability across multiple dimensions of LLM development and deployment.

Conclusion: Multi-dimensional sourcing approaches improve transparency, accountability, and trustworthiness of LLMs in real-world applications by enabling traceability across model development and deployment processes.

Abstract: Due to the black-box nature of large language models (LLMs) and the realism of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement have become significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation organized around four interrelated dimensions: Model Sourcing, Model Structure Sourcing, Training Data Sourcing, and External Data Sourcing. Moreover, a unified dual-paradigm taxonomy is proposed that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.

[96] Invisible Languages of the LLM Universe

Saurabh Khanna, Xinxu Li

Main category: cs.CL

TL;DR: The paper analyzes linguistic inequality in AI systems, revealing that 27% of languages with millions of speakers are “Invisible Giants” - high vitality but near-zero digital presence, reflecting colonial-era power structures in contemporary AI development.

Details

Motivation: Despite massive multilingual training data, LLMs exclude approximately 2,000 languages with millions of speakers, creating a crisis of linguistic invisibility in digital ecosystems that requires structural analysis beyond technical explanations.

Method: Proposes a critical framework connecting empirical measurements of language vitality (demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice, analyzing data across all documented human languages to categorize them into four groups.

Result: Identifies four language categories: Strongholds (33%), Digital Echoes (6%), Fading Voices (36%), and critically, Invisible Giants (27% - languages spoken by millions but with near-zero digital presence). Shows these patterns reflect colonial-era linguistic hierarchies in contemporary AI.

Conclusion: English dominance in AI is not technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge, requiring decolonization of language technology and democratization of AI access.

Abstract: Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world’s 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.

[97] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen

Main category: cs.CL

TL;DR: MTI is a training-free framework that improves LLM reasoning by selectively applying interventions only to high-uncertainty tokens, achieving significant accuracy gains with minimal computational overhead.

Details

Motivation: Current LLM scaling approaches for reasoning improvement are inefficient due to high inference computation costs. The authors discovered that reasoning uncertainty is highly localized - only a small subset of high-entropy tokens dominantly affects output correctness, suggesting opportunities for more targeted interventions.

Method: Minimal Test-Time Intervention (MTI) includes two components: (1) Selective CFG intervention that applies classifier-free guidance only at uncertain positions, and (2) Lightweight negative-prompt guidance that reuses the main model’s KV cache to approximate unconditional decoding efficiently.

Result: MTI achieves consistent gains across general, coding, and STEM tasks: +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0, while maintaining high efficiency.

Conclusion: The paper demonstrates that targeted interventions on localized uncertainty tokens can significantly improve reasoning accuracy with minimal computational overhead, offering an efficient alternative to brute-force test-time scaling approaches.

Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.

[98] OpenSIR: Open-Ended Self-Improving Reasoner

Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini

Main category: cs.CL

TL;DR: OpenSIR is a self-play framework where LLMs alternate teacher/student roles to generate and solve novel math problems without external supervision, enabling open-ended learning and performance improvements.

Details

Motivation: Current LLM reasoning methods rely on annotated datasets which limit surpassing human performance. Self-play approaches need external verifiers or can't learn open-endedly. Need a framework for autonomous, open-ended mathematical discovery.

Method: OpenSIR uses self-play where an LLM alternates between teacher (generating problems) and student (solving problems) roles. It optimizes for both difficulty (challenging appropriately) and diversity (exploring distinct concepts) when generating novel problems, starting from a single trivial seed problem.

Result: OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct improved from 73.9 to 78.3 on GSM8K and from 28.8 to 34.4 on College Math; Gemma-2-2B-Instruct rose from 38.5 to 58.7 on GSM8K.

Conclusion: OpenSIR enables open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, allowing autonomous progression from basic to advanced mathematics without external supervision.

Abstract: Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models’ ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.

[99] Can ensembles improve evidence recall? A case study

Katharina Beckh, Sven Heuser, Stefan Rüping

Main category: cs.CL

TL;DR: Ensemble approach improves recall of complete evidence identification in medical NLP tasks over individual models.

Details

Motivation: Many applications (compliance, cataloging) require identifying the full set of contributing features (complete evidence), not just minimal sufficient evidence provided by typical feature attribution methods.

Method: Case study using existing language models on medical dataset with human-annotated complete evidence; ensemble approach aggregating evidence from several models; examined ensemble sizes and effect of evidence-guided training.

Result: Ensemble approach improves evidence recall over individual models; insights provided on ensemble sizes and evidence-guided training effects.

Conclusion: Ensemble methods are effective for complete evidence identification in medical NLP applications where full feature attribution is required.

Abstract: Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications, such as compliance and cataloging, the full set of contributing features must be identified: complete evidence. We present a case study using existing language models and a medical dataset which contains human-annotated complete evidence. Our findings show that an ensemble approach, aggregating evidence from several models, improves evidence recall over individual models. We examine different ensemble sizes, the effect of evidence-guided training, and provide qualitative insights.

[100] Training Language Models to Explain Their Own Computations

Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas

Main category: cs.CL

TL;DR: LMs can be fine-tuned to generate natural language explanations of their own internal computations, showing better performance when explaining themselves than when other models explain them.

Details

Motivation: To investigate whether language models can leverage their privileged access to internal computations to produce faithful explanations of their own behavior, potentially offering a scalable complement to existing interpretability methods.

Method: Fine-tune LMs using existing interpretability techniques as ground truth to generate natural language descriptions of: (1) information encoded by LM features, (2) causal structure of internal activations, and (3) influence of specific input tokens on outputs. Train with tens of thousands of example explanations.

Result: Explainer models exhibit non-trivial generalization to new queries. Self-explanation works better than cross-model explanation (even with more capable models). LMs can learn to reliably explain their internal computations.

Conclusion: LM-generated explanations of internal computations offer a scalable complement to existing interpretability methods, with self-explanation showing advantages over external explanation.

Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs’ privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs’ internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models’ privileged access to their own internals: using a model to explain its own computations generally works better than using a different model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods. Code and data at https://github.com/TransluceAI/introspective-interp

[101] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

Main category: cs.CL

TL;DR: This paper investigates how Multimodal Large Language Models (MLLMs) interpret sound symbolism using a new dataset called LEX-ICON, finding that MLLMs show phonetic intuitions aligned with linguistic research and exhibit attention patterns focused on iconic phonemes.

Details

Motivation: The motivation is to use sound symbolism (non-arbitrary associations between phonetic forms and meanings) as a probe to understand how MLLMs interpret auditory information in human languages, bridging artificial intelligence and cognitive linguistics.

Method: The researchers created LEX-ICON, an extensive mimetic word dataset with 8,052 words from four languages (English, French, Japanese, Korean) and 2,930 pseudo-words, annotated with semantic features across text and audio modalities. They investigated MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory inputs with up to 25 semantic dimensions, measuring phoneme-level attention fraction scores across model layers.

Result: Key findings show: (1) MLLMs demonstrate phonetic intuitions that align with existing linguistic research across multiple semantic dimensions, and (2) phonosemantic attention patterns highlight models’ focus on iconic phonemes.

Conclusion: This work provides the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability, bridging domains of artificial intelligence and cognitive linguistics through systematic investigation of sound symbolism in multimodal models.

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[102] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI

Anupama Sitaraman, Bharathan Balaji, Yuvraj Agarwal

Main category: cs.CL

TL;DR: SpiderGen is an LLM-based workflow that generates Product Category Rules Process Flow Graphs for Life Cycle Assessments, reducing costs from $25k+ to under $1 and time from 21-person days to under 10 minutes.

Details

Motivation: Climate change concerns require estimating environmental impact of consumer products through Life Cycle Assessments (LCAs), which are expensive and time-consuming to create manually.

Method: SpiderGen integrates traditional LCA taxonomy/methodology with LLM reasoning capabilities to generate PCR PFGs (graphical representations of LCA procedural information).

Result: Achieves 65% F1-Score vs 53% for one-shot prompting; produces accurate LCA process information with minor errors; outperforms chain-of-thought and one-shot prompting baselines.

Conclusion: SpiderGen significantly reduces LCA costs and time while maintaining accuracy, demonstrating practical potential for environmental impact assessment at scale.

Abstract: Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the “scope” of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen’s potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than $1 USD in under 10 minutes as compared to the status quo LCA, which can cost over $25000 USD and take up to 21-person days.

[103] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin

Main category: cs.CL

TL;DR: Grid-tagging NER models enhanced with image data augmentation techniques (cropping, scaling, padding) improve recognition of discontinuous entities, achieving 1-2.5% overall F1 gains and 3.7-8.4% gains specifically for discontinuous entities.

Details

Motivation: Traditional NER methods struggle with discontinuous entities, especially cross-sentence ones, due to segmentation issues that lead to missegmentation or omission, significantly impacting recognition accuracy.

Method: Integrate image data augmentation techniques (cropping, scaling, padding) into grid-tagging models to enhance their ability to handle segmentation challenges and recognize discontinuous entities.

Result: Traditional segmentation methods fail on cross-sentence discontinuous entities. Augmented grid models achieve F1 score improvements of 1-2.5% overall and 3.7-8.4% for discontinuous entities on CADEC, ShARe13, and ShARe14 datasets.

Conclusion: Image data augmentation techniques effectively enhance grid-based models for discontinuous NER, addressing segmentation and omission issues that plague traditional methods.

Abstract: Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.

[104] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates

Yixing Xu, Chao Li, Xuanwu Yin, Spandan Tiwari, Dong Li, Ashish Sirasao, Emad Barsoum

Main category: cs.CL

TL;DR: Dual LoRA improves LoRA performance by separating low-rank matrices into magnitude and direction groups with ReLU and sign functions to better simulate full fine-tuning parameter updates.

Details

Motivation: Standard LoRA often has unsatisfactory performance due to its low-rank assumption, which doesn't adequately simulate the parameter updating process of full fine-tuning based on gradient-based optimization algorithms.

Method: Separates low-rank matrices into two groups: magnitude group (controls whether/how far to update parameters) with ReLU function, and direction group (decides forward/backward movement) with sign function. This better simulates full fine-tuning parameter updates.

Result: Consistently outperforms LoRA and state-of-the-art variants with same number of trainable parameters across various NLP tasks (natural language understanding, commonsense reasoning) on RoBERTa, DeBERTa, and LLaMA-1/2/3 models.

Conclusion: Dual LoRA effectively incorporates inductive bias into LoRA to improve performance by better simulating full fine-tuning parameter updates while maintaining parameter efficiency.

Abstract: Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language understanding (NLU) and commonsense reasoning datasets on RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.

[105] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation

Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Kotaro Funakoshi, Manabu Okumura

Main category: cs.CL

TL;DR: MBR decoding outperforms MAP for error span detection in MT evaluation by using similarity-based hypothesis selection, with distillation to reduce computational cost.

Details

Motivation: Current generative ESD methods using MAP decoding assume model probabilities perfectly correlate with human annotations, but often assign higher likelihood to incorrect annotations than human ones.

Method: Apply Minimum Bayes Risk (MBR) decoding to generative ESD using sentence- or span-level similarity functions to select candidate hypotheses based on approximate similarity to human annotations.

Result: MBR decoding significantly improves span-level performance and generally matches or outperforms MAP at system and sentence levels on WMT24 Metrics Shared Task.

Conclusion: MBR decoding is superior to MAP for ESD, and distillation can reduce computational cost while maintaining performance benefits.

Abstract: Error Span Detection (ESD) extends automatic machine translation (MT) evaluation by localizing translation errors and labeling their severity. Current generative ESD methods typically use Maximum a Posteriori (MAP) decoding, assuming that the model-estimated probabilities are perfectly correlated with similarity to the human annotation, but we often observe higher likelihood assigned to an incorrect annotation than to the human one. We instead apply Minimum Bayes Risk (MBR) decoding to generative ESD. We use a sentence- or span-level similarity function for MBR decoding, which selects candidate hypotheses based on their approximate similarity to the human annotation. Experimental results on the WMT24 Metrics Shared Task show that MBR decoding significantly improves span-level performance and generally matches or outperforms MAP at the system and sentence levels. To reduce the computational cost of MBR decoding, we further distill its decisions into a model decoded via greedy search, removing the inference-time latency bottleneck.

[106] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Michael H. Coen

Main category: cs.CL

TL;DR: This paper introduces a new evaluation framework for dialogue topic segmentation that separates boundary scoring from boundary selection, using window-tolerant F1 alongside boundary density and segment alignment diagnostics to better assess segmentation quality across different granularity regimes.

Details

Motivation: Current evaluation practice for dialogue topic segmentation relies on strict boundary matching and F1-based metrics, which don't account for varying annotation granularity. Modern LLM-based conversational systems need segmentation to manage conversation history beyond fixed context windows, but current metrics fail to properly evaluate segmentation quality across different density regimes.

Method: The paper introduces an evaluation framework with: 1) window-tolerant F1 (W-F1) that allows for boundary proximity, 2) boundary density analysis, and 3) segment alignment diagnostics (purity and coverage). This framework separates boundary scoring from boundary selection, enabling evaluation across different density regimes rather than at a single operating point.

Result: Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. Boundary-based metrics are strongly coupled to boundary density - threshold sweeps produce larger W-F1 changes than switching between methods. The framework was tested on eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions.

Conclusion: Topic segmentation should be viewed as a granularity selection problem rather than prediction of a single correct boundary set. The findings motivate separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities, providing a more nuanced evaluation framework for modern conversational systems.

Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.

[107] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman, Katie Ekström, Pablo Rosillo-Rodes, Christopher M. Danforth, Peter Sheridan Dodds

Main category: cs.CL

TL;DR: The paper analyzes Heaps’ law in conversations, finding vocabulary scaling differs by parts of speech in both real stranger video chats and fictional movie dialogues.

Details

Motivation: Conversations are fundamental to social connection and well-being, but little research has applied Heaps' law (vocabulary scaling with document length) to conversations or examined how different language features affect this scaling.

Method: The study measures Heaps’ law in two types of conversations: 1) real conversations between strangers on video chat, and 2) fictional conversations between characters in movies. The analysis examines how vocabulary scaling differs across various parts of speech.

Result: The research finds that vocabulary size scaling follows Heaps’ law differently depending on parts of speech. Both real stranger conversations and fictional movie dialogues show this pattern, indicating systematic differences in how different word categories accumulate during conversations.

Conclusion: The findings reveal that vocabulary growth in conversations is not uniform across word types, suggesting that different parts of speech follow distinct scaling patterns. These results can be interpreted through both behavioral and linguistic frameworks to better understand conversation dynamics.

Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps’ law, which holds that vocabulary size scales with document length. Little work on Heaps’ law has looked at conversation and considered how language features impact scaling. We measure Heaps’ law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

[108] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang

Main category: cs.CL

TL;DR: MobileWorld is a more challenging mobile-use benchmark than AndroidWorld, featuring 201 tasks across 20 apps with longer workflows, cross-app tasks, and novel evaluation categories like agent-user interaction and MCP-augmented tasks.

Details

Motivation: AndroidWorld has become saturated (agents achieving >90% success) and lacks key application categories (e-commerce, enterprise communication) and realistic scenarios with vague instructions and hybrid tool usage.

Method: Created MobileWorld with 201 tasks across 20 apps using open-source alternatives to industry standards (e.g., Mattermost for Slack) for reproducible evaluation. Features long-horizon, cross-application workflows, agent-user interaction tasks, and MCP-augmented tasks. Developed planner-executor agentic framework with extended action spaces.

Result: Significant performance drop compared to AndroidWorld: best agentic framework achieved 51.7% success rate, best end-to-end model achieved 20.9% success rate, showing ample room for improvement.

Conclusion: MobileWorld provides a more challenging and realistic benchmark for mobile-use agents, highlighting current limitations and creating substantial headroom for future research in mobile AI assistance.

Abstract: Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks across 20 applications. MobileWorld derives its difficulty from an emphasis on long-horizon, cross-application workflows, requiring nearly twice as many completion steps on average (27.8 vs. 14.3) and featuring a significantly higher proportion of multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. To overcome the limitations of existing environments, MobileWorld achieves a balance between production-grade utility and reproducible evaluation by utilizing open-source alternatives to industry standards (e.g., Mattermost for Slack). This approach enables a fully observable and controlled environment through source code modification and direct backend database access for precise verification. MobileWorld also introduces novel task categories, including agent-user interaction and Model Context Protocol (MCP)-augmented tasks, for evaluating agents in user-aware, hybrid-tool scenarios. To facilitate evaluation, we develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting ample headroom for future research.

[109] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations

Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang, Chuang Hu

Main category: cs.CL

TL;DR: HIC-Bench is a novel evaluation framework that categorizes LLM hallucinations into Intelligent Hallucinations (creative/valuable) and Defective Hallucinations (erroneous), enabling systematic study of their interplay in scientific innovation.

Details

Motivation: Current hallucination detection methods focus too narrowly on factual consistency, failing to handle heterogeneous scientific tasks and balance creativity with accuracy. There's a need to quantify the creative and epistemically valuable aspects of hallucinations that are overlooked in existing literature.

Method: Proposes HIC-Bench framework with: (1) Structured IH/DH assessment using multi-dimensional metrics combining TTCT creativity metrics (Originality, Feasibility, Value) with hallucination-specific dimensions; (2) Cross-domain applicability across ten scientific domains; (3) Dynamic Prompt Optimization using DHP to guide models toward creative and reliable outputs. Uses multiple LLM judges with human verification.

Result: Reveals a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. Shows that Intelligent Hallucinations can serve as catalysts for creativity and drive scientific innovation.

Conclusion: HIC-Bench provides a valuable platform for advancing research into the creative intelligence of LLM hallucinations, positioning IH as a creative catalyst and revealing LLM hallucinations’ potential to drive scientific innovation beyond being mere errors.

Abstract: Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.

[110] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao

Main category: cs.CL

TL;DR: UniHetero shows that semantic-level generation (not pixel-level) enhances vision understanding at scale, reveals superior data scaling trends, and demonstrates effective visual detail capture through autoregression on input embeddings.

Details

Motivation: To explore whether visual generation can enhance visual understanding in unified vision-language models at large data scale (>200M samples), challenging the common assumption that generation naturally strengthens understanding.

Method: Proposes UniHetero, a concise unified model structure tested with large-scale pretraining. Key approaches: semantic-level autoregression of high-level visual representations inside LLM (not pixel-level generation), analysis of data scaling trends, and using autoregression on input embeddings instead of vision encoders.

Result: Three key findings: (1) Generation improves understanding only at semantic level, not pixel level (which degrades performance); (2) Unified generation-understanding shows superior data scaling and higher data utilization than understanding alone; (3) Autoregression on input embeddings effectively captures visual details with less cumulative error and is modality-independent.

Conclusion: Semantic-level visual generation enhances understanding in unified vision-language models at scale, revealing better data efficiency and scaling properties, while providing a modality-independent approach for capturing visual details through autoregression on embeddings.

Abstract: Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with the LLM, understanding performance often degrades. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. Unified generation-understanding demonstrates a superior scaling trend compared to understanding alone, revealing a more effective way to learn vision-only knowledge directive from vision modality rather than captioning to text. (3) Autoregression on Input Embedding is effective to capture visual details. Compared to the commonly-used vision encoder, make visual autoregression on input embedding shows less cumulative error and is modality independent, which can be extend to all modalities. The learned semantic representations capture visual information such as objects, locations, shapes, and colors; further enable pixel-level image generation.

cs.CV

[111] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Ankan Aich, Yangming Lee

Main category: cs.CV

TL;DR: Monocular depth estimation for robotic surgery using synthetic priors from Depth Anything V2 with DV-LORA adaptation, achieving SOTA results with 98.1% accuracy and 17% error reduction on SCARED dataset.

Details

Motivation: Current self-supervised monocular depth estimation methods fail in endoscopic environments with specular reflections and fluids, suffering from boundary collapse on surgical tools and transparent surfaces.

Method: Leverage synthetic priors from Depth Anything V2 architecture, adapt to medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA) with minimal parameters, and introduce physically-stratified evaluation protocol on SCARED dataset.

Result: Achieves new state-of-the-art with 98.1% accuracy (< 1.25 threshold) and reduces Squared Relative Error by over 17% compared to baselines, demonstrating superior robustness in adverse surgical lighting.

Conclusion: The approach effectively bridges synthetic-to-real gap for surgical depth estimation, handling specular and fluid-filled environments while preserving geometric details of thin surgical structures.

Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

[112] Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments

Surya Rayala, Marcos Quinones-Grueiro, Naveeduddin Mohammed, Ashwin T S, Benjamin Goldberg, Randall Spain, Paige Lawton, Gautam Biswas

Main category: cs.CV

TL;DR: Video-based assessment pipeline for urban warfare training using computer vision to extract performance metrics from training videos without additional hardware.

Details

Motivation: Current urban warfare training lacks scalable, objective performance assessment methods. Traditional approaches rely on costly sensors or subjective human observation, limiting accuracy and scalability in Synthetic Training Environments.

Method: Computer vision models extract 2D skeletons, gaze vectors, and movement trajectories from training videos. Task-specific metrics measure psychomotor fluency, situational awareness, and team coordination. These feed into an extended Cognitive Task Analysis hierarchy with weighted combination for overall scores.

Result: Demonstrated approach with real-world Enter and Clear the Room drills, providing actionable domain-specific metrics for individual and team performance. Supports After Action Reviews with interactive dashboards in Gamemaster and GIFT frameworks.

Conclusion: Video-based assessment enables scalable evaluation in Synthetic Training Environments. Limitations include tracking difficulties and ground-truth validation. Future work includes 3D video analysis and broader application of video analysis for scalable STE evaluation.

Abstract: Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.

Yizhi Liu, Ruitao Pu, Shilin Xu, Yingke Chen, Quan-Hui Liu, Yuan Sun

Main category: cs.CV

TL;DR: NIRNL is a robust cross-modal retrieval framework that handles noisy labels through neighbor-aware instance refining and cross-modal margin preserving to enhance discrimination and data utilization.

Details

Motivation: Cross-modal retrieval suffers from noisy labels in multi-modal data collection, which degrades model performance. Existing robust CMR methods fail to simultaneously achieve high performance ceilings, reliable calibration, and good data utilization rates.

Method: Proposes NIRNL framework with two key components: 1) Cross-modal Margin Preserving (CMP) to adjust relative distances between positive/negative pairs for better discrimination, and 2) Neighbor-aware Instance Refining (NIR) to identify pure, hard, and noisy subsets through cross-modal neighborhood consensus, then applies tailored optimization strategies for each subset.

Result: Extensive experiments on three benchmark datasets show NIRNL achieves state-of-the-art performance with remarkable robustness, especially under high noise rates.

Conclusion: NIRNL effectively addresses noisy label problems in cross-modal retrieval by combining margin preserving and neighbor-aware instance refining, maximizing data utilization while mitigating error propagation.

Abstract: In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

[114] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala

Main category: cs.CV

TL;DR: PFP is a neural network that compresses long videos into short contexts while preserving high-frequency details of individual frames, enabling efficient long-term memory for autoregressive video models.

Details

Motivation: The paper aims to address the challenge of handling long video sequences in autoregressive models, where maintaining long history memory typically requires high context costs. There's a need to compress video content efficiently while preserving visual details for accurate frame retrieval.

Method: PFP uses a neural network structure with explicit pretraining objectives to preserve high-frequency details of single frames at arbitrary temporal positions. The model compresses 20-second videos into contexts of about 5k length, allowing random frame retrieval with perceptually preserved appearances. The pretrained models can be fine-tuned as memory encoders for autoregressive video models.

Result: The framework enables long history memory with low context cost and relatively low fidelity loss. The authors evaluate the approach with ablative settings and discuss trade-offs of different neural architecture designs.

Conclusion: PFP provides an effective solution for compressing long videos into compact representations while maintaining frame-level details, making it suitable as a memory encoder for autoregressive video models that need to handle extended temporal sequences efficiently.

Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[115] Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

Main category: cs.CV

TL;DR: D²VLM is a video-language model framework that factorizes temporal grounding and textual response learning using evidence tokens and factorized preference optimization, achieving better event-level video understanding.

Details

Motivation: Existing video-language models struggle with accurate temporal grounding for event-level perception. Current approaches handle temporal grounding and textual response in a coupled manner without clear logical hierarchy, leading to suboptimal performance.

Method: Proposes D²VLM framework with: 1) “Grounding then answering with evidence referencing” paradigm using evidence tokens for event-level visual semantic capture; 2) Factorized Preference Optimization (FPO) algorithm that explicitly incorporates probabilistic temporal grounding modeling into optimization; 3) Synthetic dataset construction for factorized preference learning.

Result: Experiments on various tasks demonstrate clear advantages over existing approaches, showing improved temporal grounding and textual response capabilities.

Conclusion: Factorizing temporal grounding and textual response learning with explicit dependency modeling and specialized optimization leads to better video-language understanding performance.

Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a “grounding then answering with evidence referencing” paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

[116] Lifelong Domain Adaptive 3D Human Pose Estimation

Qucheng Peng, Hongfei Xue, Pu Wang, Chen Chen

Main category: cs.CV

TL;DR: Proposes lifelong domain adaptation for 3D human pose estimation to handle non-stationary target datasets, using a novel GAN framework with 3D pose generators, 2D discriminator, and 3D estimator to mitigate domain shifts and combat catastrophic forgetting.

Details

Motivation: 3D HPE struggles with generalization to diverse real-world scenarios due to reliance on controlled environment data. Existing domain adaptation methods overlook non-stationary target pose datasets, creating a need for lifelong adaptation that handles sequential target domains while preserving previous knowledge.

Method: Introduces lifelong domain adaptation for 3D HPE with a GAN framework containing 3D pose generators, 2D pose discriminator, and 3D pose estimator. Uses novel 3D pose generator paradigm integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance current domain adaptation and alleviate catastrophic forgetting.

Result: Demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets, showing effective mitigation of domain shifts and alignment of original and augmented poses.

Conclusion: First to introduce lifelong domain adaptation to 3D HPE, successfully addressing challenges of adapting to current domains while preserving knowledge from previous domains, with a framework that effectively combats catastrophic forgetting in non-stationary target datasets.

Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain’s adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

[117] LiftProj: Space Lifting and Projection-Based Panorama Stitching

Yuan Jia, Ruimin Wu, Rui Song, Jiaojiao Li, Bin Song

Main category: cs.CV

TL;DR: A 3D panoramic stitching framework that lifts images to 3D point clouds for global fusion, then projects to a panoramic manifold, reducing distortions in complex scenes with parallax and occlusions.

Details

Motivation: Traditional 2D homography and mesh warping methods fail in real 3D scenes with multiple depth layers and occlusions, causing ghosting, bending, and stretching distortions, especially in multi-view and 360° closed-loop stitching scenarios.

Method: 1) Lift input images to dense 3D point representations in unified coordinate system with confidence metrics; 2) Establish unified 3D projection center and use equidistant cylindrical projection to map fused data to panoramic manifold; 3) Perform hole filling in canvas domain to address unknown regions from viewpoint transitions.

Result: Experimental evaluations show the method substantially mitigates geometric distortions and ghosting artifacts in scenarios with significant parallax and complex occlusions, producing more natural and consistent panoramic results.

Conclusion: The framework successfully shifts stitching from 2D warping to 3D consistency paradigm, offering flexibility to incorporate various 3D lifting and completion modules for improved panoramic quality in complex real-world scenes.

Abstract: Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.

[118] MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework

Krithika Iyer, Austin Tapp, Athelia Paulli, Gabrielle Dickerson, Syed Muhammad Anwar, Natasha Lepore, Marius George Linguraru

Main category: cs.CV

TL;DR: Deep learning pipeline transforms pediatric T1-weighted MRIs into synthetic CTs for cranial bone segmentation and suture analysis, achieving high accuracy without radiation exposure.

Details

Motivation: CT scans provide detailed cranial bone and suture information but involve harmful ionizing radiation for children. MRI is radiation-free but cannot visualize cranial sutures or assess bone density. There's a need for non-invasive methods to evaluate pediatric cranial development and suture ossification.

Method: Proposes a deep learning pipeline that transforms T1-weighted MRIs of children (0.2-2 years) into synthetic CTs (sCTs). The method predicts detailed cranial bone segmentation, generates suture probability heatmaps, and derives direct suture segmentation from these heatmaps using domain-specific variational autoencoders.

Result: sCTs achieved 99% structural similarity and Frechet inception distance of 1.01 compared to real CTs. Skull segmentation attained average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Statistical equivalence between sCTs and real CTs was confirmed using two one-sided tests (TOST p < 0.05).

Conclusion: This is the first pediatric cranial CT synthesis framework enabling suture segmentation from MRI-derived sCTs, despite MRI’s limited bone depiction. The method bridges critical gaps in non-invasive cranial evaluation by generating perceptually indistinguishable cranial sCTs from routine pediatric MRIs.

Abstract: Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI’s limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.

[119] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

Rongji Xun, Junjie Yuan, Zhongjie Wang

Main category: cs.CV

TL;DR: HaineiFRDM is a diffusion-based film restoration framework that enables high-resolution restoration on limited GPU memory using patch-wise training, global-local fusion modules, and a novel dataset with real-degraded films.

Details

Motivation: Existing open-source film restoration methods have limited performance due to training with low-quality synthetic data, noisy optical flows, and inability to handle high-resolution films. There's a need for better open-source alternatives to commercial methods.

Method: Proposes HaineiFRDM with: 1) Patch-wise training/testing for high-resolution films on 24GB GPU, 2) Position-aware Global Prompt and Frame Fusion Modules, 3) Global-local frequency module for consistent textures, 4) Low-resolution restoration as global residual to reduce blocky artifacts, 5) New dataset with real-degraded films and realistic synthetic data.

Result: Comprehensive experiments demonstrate superior defect restoration ability over existing open-source methods. The model effectively handles high-resolution film restoration with better quality than current open-source approaches.

Conclusion: HaineiFRDM successfully addresses limitations of existing open-source film restoration methods by leveraging diffusion models’ content understanding, enabling high-resolution restoration on accessible hardware, and providing a better dataset. The framework and dataset will be released to advance open-source film restoration.

Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source methods.We propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model’s powerful content-understanding ability to help human expert better restore indistinguishable film defects.Specifically, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion Modules.Also, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching process.Furthermore, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic data.Comprehensive experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

[120] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Charith Wickrema, Eliza Mace, Hunter Brown, Heidys Cabrera, Nick Krall, Matthew O’Neill, Shivangi Sarkar, Lowell Weissman, Eric Hughes, Guido Zarrella

Main category: cs.CV

TL;DR: The paper explores scaling laws for training foundation models on massive-scale electro-optical satellite data, finding that performance remains data-limited even at petascale, with implications for remote sensing AI development.

Details

Motivation: Current scaling laws for AI models are well-established for natural images with abundant internet data, but poorly understood for high-value domains like remote sensing where data is more limited and specialized.

Method: Used over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox to train progressively larger vision transformer (ViT) backbones, analyzing scaling behaviors at petascale.

Result: Even at massive scale (petascale), performance remains consistent with a data-limited regime rather than model parameter-limited one, with observed success and failure modes at extreme scales.

Conclusion: The findings provide practical insights to inform data-collection strategies, compute budgets, and optimization schedules for developing frontier-scale remote sensing foundation models, helping bridge domain gaps across RS modalities.

Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

[121] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Devendra K. Jangid, Ripon K. Saha, Dilshan Godaliyadda, Jing Li, Seok-Jun Lee, Hamid R. Sheikh

Main category: cs.CV

TL;DR: The paper introduces a Feature-to-Image Diffusion (F2IDiff) Foundation Model for Single Image Super-Resolution that uses lower-level DINOv2 features instead of text features to provide stricter, hallucination-free conditioning suitable for consumer photography.

Details

Motivation: Current Text-to-Image Diffusion models for SISR cause undesirable hallucinations in consumer photography where LR images have high fidelity. Text features are too high-level to describe subtle textures, and smartphone LR images (12MP+) are much larger than what T2IDiff models handle (<1MP), requiring patch-based inference where text features fail.

Method: Proposes a Feature-to-Image Diffusion (F2IDiff) Foundation Model that uses lower-level DINOv2 features for conditioning instead of text features. These lower-level features provide stricter conditioning while being rich descriptors of even small image patches.

Result: Not specified in the abstract, but the proposed method addresses the shortcomings of text-conditioned diffusion models for SISR in consumer photography applications.

Conclusion: Lower-level feature conditioning (DINOv2 features) is more suitable for SISR in consumer photography than text conditioning, as it provides stricter, hallucination-free generation while maintaining rich descriptive capabilities for small image patches.

Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM’s conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

[122] Learning to learn skill assessment for fetal ultrasound scanning

Yipei Wang, Qianye Yang, Lior Drukker, Aris T. Papageorghiou, Yipeng Hu, J. Alison Noble

Main category: cs.CV

TL;DR: A bi-level optimization framework for automated fetal ultrasound skill assessment that predicts skills based on task performance without predefined ratings.

Details

Motivation: Traditional ultrasound skill assessment is subjective and time-intensive, while existing automated methods rely on supervised learning with predetermined factors, limiting their flexibility and objectivity.

Method: Proposes a novel bi-level optimization framework with two jointly optimized networks: a clinical task predictor and a skill predictor. The framework assesses skills by how well tasks are performed on acquired ultrasound images, without using manually predefined skill ratings.

Result: Validated on real-world clinical ultrasound videos of fetal head scanning. Results demonstrate feasibility of predicting ultrasound skills by quantifying optimized task performance as a skill indicator.

Conclusion: The proposed framework offers an objective, automated approach to ultrasound skill assessment that avoids the subjectivity of expert supervision and limitations of predetermined skill factors in traditional methods.

Abstract: Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.

[123] Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: EASI is a comprehensive evaluation framework for assessing multimodal LLMs’ spatial intelligence, revealing that while GPT-5 shows unprecedented strength, all models still significantly lag behind human performance on spatial reasoning tasks.

Details

Motivation: Despite remarkable progress in multimodal models, they still exhibit notable limitations in spatial understanding and reasoning - a crucial capability for artificial general intelligence in the physical world. With the release of GPT-5, it's timely to systematically evaluate leading models' spatial intelligence capabilities.

Method: EASI proposes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and newly curated ones. The study evaluates eight key benchmarks across leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) at a cost exceeding ten billion total tokens, plus qualitative evaluation on human-intuitive scenarios.

Result: (1) GPT-5 demonstrates unprecedented strength in spatial intelligence but (2) still falls significantly short of human performance across broad SI-tasks. (3) SI-tasks expose greater model capability deficiency than non-SI tasks, and (4) proprietary models don’t show decisive advantage on the most difficult tasks.

Conclusion: Spatial intelligence remains a significant challenge for current multimodal models. EASI provides an open-source framework and leaderboard to accelerate collective progress toward robust spatial intelligence, highlighting the need for continued research in this fundamental capability.

Abstract: Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.

Yulong Zou, Bo Liu, Cun-Jing Zheng, Yuan-ming Geng, Siyue Li, Qiankun Zuo, Shuihua Wang, Yudong Zhang, Jin Hong

Main category: cs.CV

TL;DR: A novel meta-guided multimodal learning framework for brain tumor segmentation that handles incomplete MRI data through adaptive modality fusion and consistency regularization.

Details

Motivation: Multimodal MRI data is crucial for brain tumor segmentation but often incomplete in clinical practice, creating a need for methods that can effectively utilize available information despite missing modalities.

Method: Proposes MGML framework with two components: 1) Meta-parameterized adaptive modality fusion (Meta-AMF) that integrates information from available modalities and generates adaptive soft-label supervision, and 2) Consistency regularization module that enhances segmentation performance and framework robustness.

Result: Achieved superior performance on BraTS2020 and BraTS2023 datasets compared to state-of-the-art methods. On BraTS2020, obtained average Dice scores of 87.55 (WT), 79.36 (TC), and 62.67 (ET) across fifteen missing modality combinations.

Conclusion: The proposed MGML framework effectively handles incomplete multimodal MRI data without altering original model architecture, can be integrated into training pipelines, and demonstrates strong performance on brain tumor segmentation tasks.

Abstract: Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at https://github.com/worldlikerr/MGML.

[125] Scaling Spatial Intelligence with Multimodal Foundation Models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

Main category: cs.CV

TL;DR: SenseNova-SI family scales multimodal foundation models with 8M diverse spatial data to achieve state-of-the-art spatial intelligence while maintaining strong general multimodal understanding.

Details

Motivation: Despite progress in multimodal foundation models, they still show surprising deficiencies in spatial intelligence. The work aims to address this gap by scaling up models specifically for spatial reasoning capabilities.

Method: Built upon established multimodal foundations (Qwen3-VL, InternVL3, Bagel), systematically curated SenseNova-SI-8M dataset with 8 million diverse samples under rigorous spatial capability taxonomy. Used principled approach to construct high-performing and robust spatial intelligence.

Result: Achieved unprecedented performance across spatial benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, 50.1% on SITE, while maintaining 84.9% on MMBench-En for general understanding. Analyzed data scaling effects, emergent generalization, overfitting risks, and spatial chain-of-thought reasoning.

Conclusion: SenseNova-SI demonstrates successful cultivation of spatial intelligence through systematic data scaling, shows early signs of emergent generalization, and validates practical applications. The project is ongoing with all models publicly released to advance spatial intelligence research.

Abstract: Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

[126] Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation

Hualin Ye, Bingxi Liu, Jixiang Du, Yu Qin, Ziyi Chen, Hong Zhang

Main category: cs.CV

TL;DR: Proposes a novel cross-view geo-localization system with three key improvements: DINOv2 backbone with convolution adapter, multi-scale channel reallocation module, and improved aggregation with MoE routing for adaptive feature processing.

Details

Motivation: Cross-view geo-localization faces significant challenges due to viewpoint discrepancies between query and database images, making feature aggregation and alignment difficult. Existing methods struggle with effective feature representation across different viewpoints.

Method: Three main components: 1) DINOv2 backbone with convolution adapter for better adaptation to cross-view variations, 2) Multi-scale channel reallocation module to enhance spatial representation diversity and stability, 3) Improved aggregation module with Mixture-of-Experts routing that dynamically selects expert subspaces for keys and values in cross-attention framework.

Result: Achieves competitive performance on University-1652 and SUES-200 datasets with fewer trained parameters compared to existing methods.

Conclusion: The proposed system effectively addresses cross-view geo-localization challenges through adaptive feature processing and efficient parameter usage, demonstrating strong performance on benchmark datasets.

Abstract: Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.

[127] Kinematic-Based Assessment of Surgical Actions in Microanastomosis

Yan Meng, Daniel Donoho, Marcelle Altshuler, Omar Arnaout

Main category: cs.CV

TL;DR: AI framework for automated action segmentation and performance assessment in microanastomosis surgery using computer vision and machine learning for objective surgical skill evaluation.

Details

Motivation: Microanastomosis requires precise surgical skills, but current assessment methods rely on subjective expert evaluation which suffers from inter-rater variability, inconsistency, and time consumption, necessitating automated, objective solutions.

Method: Three-component AI framework: (1) YOLO and DeepSORT for object tip tracking, (2) self-similarity matrix for action boundary detection and unsupervised clustering for segmentation, (3) supervised classification for surgical gesture proficiency evaluation.

Result: Achieved 92.4% frame-level action segmentation accuracy and 85.5% overall skill classification accuracy on 58 expert-rated microanastomosis videos, effectively replicating expert evaluations.

Conclusion: The AI framework enables objective, real-time feedback for microsurgical training, supporting standardized, data-driven protocols and advancing competency assessment in high-stakes surgical environments.

Abstract: Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.

[128] U-Net-Like Spiking Neural Networks for Single Image Dehazing

Huibin Li, Haoran Liu, Mingzhe Liu, Yulong Xiao, Peng Li, Guibin Zan

Main category: cs.CV

TL;DR: DehazeSNN: A novel image dehazing architecture combining U-Net design with Spiking Neural Networks to address limitations of CNNs and Transformers, achieving competitive performance with reduced computational cost.

Details

Motivation: Traditional dehazing methods rely on atmospheric scattering models, while modern deep learning approaches (CNNs and Transformers) have limitations: CNNs struggle with long-range dependencies, and Transformers require heavy computational resources.

Method: Proposes DehazeSNN, a U-Net-like architecture integrated with Spiking Neural Networks (SNNs) that captures multi-scale features while managing local and long-range dependencies efficiently. Introduces Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) to enhance cross-channel communication.

Result: Extensive experiments show DehazeSNN is highly competitive with state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with smaller model size and fewer multiply-accumulate operations.

Conclusion: DehazeSNN provides an effective solution for image dehazing that overcomes limitations of existing approaches, offering superior performance with reduced computational burden, making it practical for real-world applications.

Abstract: Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at https://github.com/HaoranLiu507/DehazeSNN.

[129] Effective Online Exam Proctoring by Combining Lightweight Face Detection and Deep Recognition

Xu Yang, Juantao Zhong, Daoyuan Wu, Xiao Yi, Jimmy H. M. Lee, Tan Lee, Peng Han

Main category: cs.CV

TL;DR: iExam is an online exam proctoring system that uses real-time face detection and deep face recognition to monitor student presence and detect cheating behaviors during Zoom-based exams with high accuracy and low overhead.

Details

Motivation: Online exams via platforms like Zoom have become widespread, but ensuring exam integrity is challenging due to the difficulty of monitoring multiple video feeds in real time. Current solutions lack efficient monitoring and analysis capabilities.

Method: iExam combines lightweight real-time face detection with deep face recognition for post-exam analysis. It monitors student presence, detects abnormal behaviors (face disappearance, rotation, identity substitution), uses enhanced OCR for dynamic Zoom name tags, and is designed for resource-efficient training/inference on standard teacher devices.

Result: Extensive experiments show iExam achieves 90.4% accuracy in real-time face detection and 98.4% accuracy in post-exam face recognition with low overhead, demonstrating practical effectiveness for online exam proctoring.

Conclusion: iExam provides a practical and effective solution for online exam proctoring by addressing key challenges in real-time monitoring and post-exam analysis, with high accuracy and low computational requirements suitable for standard teacher devices.

Abstract: Online exams conducted via video conferencing platforms such as Zoom have become widespread, yet ensuring exam integrity remains challenging due to the difficulty of monitoring multiple video feeds in real time. We present iExam, an online exam proctoring and analysis system that combines lightweight real-time face detection with deep face recognition for postexam analysis. iExam assists invigilators by monitoring student presence during exams and identifies abnormal behaviors, such as face disappearance, face rotation, and identity substitution, from recorded videos. The system addresses three key challenges: (i)efficient real-time video capture and analysis, (ii) automated student identity labeling using enhanced OCR on dynamic Zoom name tags, and (iii) resource-efficient training and inference on standard teacher devices. Extensive experiments show that iExam achieves 90.4% accuracy in real-time face detection and 98.4% accuracy in post-exam recognition with low overhead, demonstrating its practicality and effectiveness for online exam proctoring.

[130] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: T2VAttack: First comprehensive study of adversarial attacks on Text-to-Video diffusion models, revealing vulnerabilities through semantic and temporal attacks via prompt manipulation.

Details

Motivation: Despite rapid advancements in Text-to-Video (T2V) diffusion models generating high-quality videos, their vulnerability to adversarial attacks remains largely unexplored, creating a critical security gap.

Method: Proposes T2VAttack with two attack objectives (semantic for video-text alignment, temporal for temporal dynamics) and two attack methods: T2VAttack-S (replaces critical words with synonyms via greedy search) and T2VAttack-I (iteratively inserts optimized words with minimal perturbation).

Result: Evaluation on state-of-the-art T2V models (ModelScope, CogVideoX, Open-Sora, HunyuanVideo) shows even minor prompt modifications (single word substitution/insertion) cause substantial degradation in semantic fidelity and temporal dynamics.

Conclusion: Current T2V diffusion models have critical vulnerabilities to adversarial attacks through prompt manipulation, highlighting the need for improved robustness in video generation systems.

Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

[131] HIDFlowNet: A Flow-Based Deep Network for Hyperspectral Image Denoising

Qizhou Wang, Li Pang, Xiangyong Cao, Zhiqiang Tian, Deyu Meng

Main category: cs.CV

TL;DR: HIDFlowNet: A flow-based hyperspectral image denoising network that learns conditional distributions to address ill-posed nature of HSI denoising and decouples low/high-frequency learning.

Details

Motivation: Existing deep learning methods for HSI denoising treat it as deterministic mapping, ignoring the ill-posed nature (multiple clean HSIs can produce same noisy image) and resulting in over-smoothing. They also fail to properly decouple low-frequency and high-frequency component learning.

Method: Proposes HIDFlowNet based on generative flow model with invertible decoder and conditional encoder. Invertible decoder uses stacked invertible conditional blocks (ICBs) to capture local high-frequency details. Conditional encoder uses down-sampling and transformers to extract global low-frequency information.

Result: Extensive experiments on simulated and real HSI datasets show HIDFlowNet achieves better or comparable results compared with state-of-the-art methods.

Conclusion: HIDFlowNet effectively addresses ill-posed nature of HSI denoising by learning conditional distributions and properly decoupling low/high-frequency learning, overcoming limitations of deterministic mapping approaches.

Abstract: Hyperspectral image (HSI) denoising is essentially ill-posed since a noisy HSI can be degraded from multiple clean HSIs. However, existing deep learning (DL)-based approaches only restore one clean HSI from the given noisy HSI with a deterministic mapping, thus ignoring the ill-posed issue and always resulting in an over-smoothing problem. Additionally, these DL-based methods often neglect that noise is part of the high-frequency component and their network architectures fail to decouple the learning of low-frequency and high-frequency. To alleviate these issues, this paper proposes a flow-based HSI denoising network (HIDFlowNet) to directly learn the conditional distribution of the clean HSI given the noisy HSI and thus diverse clean HSIs can be sampled from the conditional distribution. Overall, our HIDFlowNet is induced from the generative flow model and is comprised of an invertible decoder and a conditional encoder, which can explicitly decouple the learning of low-frequency and high-frequency information of HSI. Specifically, the invertible decoder is built by staking a succession of invertible conditional blocks (ICBs) to capture the local high-frequency details. The conditional encoder utilizes down-sampling operations to obtain low-resolution images and uses transformers to capture correlations over a long distance so that global low-frequency information can be effectively extracted. Extensive experiments on simulated and real HSI datasets verify that our proposed HIDFlowNet can obtain better or comparable results compared with other state-of-the-art methods.

[132] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

Yuang Jia, Jinlong Wang, Jiayi Zhao, Chunlam Li, Shunzhou Wang, Wei Gao

Main category: cs.CV

TL;DR: A novel view extrapolation method for autonomous driving that uses deformable 4D Gaussians and video diffusion models without expensive sensors or annotations.

Details

Motivation: Existing view extrapolation methods rely on expensive sensors (LiDAR) or labor-intensive annotations (3D bounding boxes, lane markings), limiting real-world deployment. The paper aims to achieve high-quality view extrapolation using only images and optional camera poses.

Method: 1) Estimate global static point cloud and per-frame dynamic point clouds from images/camera poses; 2) Fuse into unified representation; 3) Use deformable 4D Gaussian framework for scene reconstruction; 4) Train video diffusion model on degraded renders from initial 4D Gaussian model; 5) Iteratively refine progressively shifted Gaussian renderings using diffusion model; 6) Incorporate enhanced results back into 4DGS training; 7) Repeat until reaching target extrapolated viewpoints.

Result: The method produces higher-quality images at novel extrapolated viewpoints compared to baselines, demonstrating effective view extrapolation without expensive sensors or annotations.

Conclusion: The proposed approach enables effective view extrapolation in autonomous driving scenarios using only images and optional camera poses, overcoming limitations of prior methods that require expensive sensors or labor-intensive annotations.

Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

[133] Anomaly detection in satellite imagery through temporal inpainting

Bertrand Rouet-Leduc, Claudia Hulbert

Main category: cs.CV

TL;DR: Deep learning inpainting model detects surface changes from satellite imagery by predicting expected appearance and identifying anomalies, outperforming traditional methods with 3x lower detection thresholds.

Details

Motivation: Surface change detection from satellite imagery is critical for disaster response and environmental monitoring but remains challenging due to atmospheric noise, seasonal variations, and sensor artifacts. Current methods lack sensitivity to detect subtle changes.

Method: Train an inpainting model based on SATLAS foundation model to reconstruct the last frame of Sentinel-2 time series from preceding acquisitions. Use globally distributed training data across diverse climate zones and land cover types. Detect changes by comparing prediction vs. observation discrepancies.

Result: Successfully detected earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, identifying a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Achieved detection thresholds approximately three times lower than baseline approaches.

Conclusion: Deep learning-based inpainting of satellite time series enables unprecedented sensitivity in surface change detection, providing a path toward automated, global-scale monitoring using freely available multi-spectral satellite data.

Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

[134] GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention

Jun Ding, Shang Gao

Main category: cs.CV

TL;DR: GCA-ResUNet: A lightweight CNN with Grouped Coordinate Attention for efficient medical image segmentation, outperforming Transformers on benchmarks while maintaining computational efficiency.

Details

Motivation: U-Net CNNs struggle with long-range dependencies in multi-organ segmentation and low-contrast regions, while Transformers require heavy computation and large datasets, making them impractical for clinical deployment.

Method: Proposes GCA-ResUNet with plug-and-play Grouped Coordinate Attention module that decouples channel context into groups for semantic heterogeneity and uses direction-aware coordinate encoding to capture spatial dependencies along horizontal/vertical axes.

Result: Achieves Dice scores of 86.11% on Synapse and 92.64% on ACDC benchmarks, outperforming CNN and Transformer methods including Swin-UNet and TransUNet, with particular improvements on small anatomical structures.

Conclusion: GCA-ResUNet provides optimal balance between accuracy and efficiency, offering a practical, scalable solution for clinical deployment in resource-constrained environments.

Abstract: Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical environments.In this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.

[135] Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

Main category: cs.CV

TL;DR: Zoomer is a visual prompting framework that improves multimodal LLMs’ accuracy on fine details by using region-adaptive attention and flexible token allocation, reducing hallucinations while cutting token usage.

Details

Motivation: Current MLLMs like GPT-4o, Gemini Pro, and Claude 3.5 often hallucinate in real-world scenarios, especially with small objects or fine spatial details, due to uniform downsampling and lack of region-adaptive attention.

Method: Zoomer integrates three components: (1) prompt-aware emphasis module to highlight relevant regions, (2) spatial-preserving orchestration schema to maintain object relationships, and (3) budget-aware strategy for adaptive token allocation between global context and local details.

Result: Extensive experiments on nine benchmarks with three commercial MLLMs show Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67%.

Conclusion: Zoomer establishes a principled methodology for robust, resource-aware multimodal understanding in black-box settings where model internals are inaccessible.

Abstract: Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real world scenarios especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer, a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to adaptively allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.

[136] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Haotang Li, Zhenyu Qi, Hao Qin, Huanrui Yang, Sen He, Kebin Peng

Main category: cs.CV

TL;DR: GASeg is a self-supervised semantic segmentation framework that uses topological information to bridge appearance and geometry, addressing appearance ambiguities through differentiable box-counting and topological augmentation.

Details

Motivation: Self-supervised semantic segmentation methods often fail due to over-reliance on unstable appearance-based features like shadows, glare, and local textures. The paper aims to address appearance ambiguities by incorporating stable topological information.

Method: Proposes GASeg with three key components: 1) Differentiable Box-Counting (DBC) module that quantifies multi-scale topological statistics from geometric and appearance features, 2) Topological Augmentation (TopoAug) that applies morphological operators to simulate real-world ambiguities, and 3) GALoss that enforces cross-modal alignment between geometric and appearance features.

Result: Achieves state-of-the-art performance on four benchmarks including COCO-Stuff, Cityscapes, and PASCAL, validating the approach of bridging geometry and appearance via topological information.

Conclusion: The paper demonstrates that leveraging stable topological information can effectively address appearance ambiguities in self-supervised semantic segmentation, leading to improved performance across multiple benchmarks.

Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[137] Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge

Tae Ha Park, Simone D’Amico

Main category: cs.CV

TL;DR: Novel pipeline for 3D reconstruction of unknown spacecraft using 3D Gaussian Splatting with Sun position priors to handle dynamic space lighting conditions.

Details

Motivation: Traditional 3DGS requires static scenes, but spaceborne imagery during RPO has dynamic lighting conditions. Accurate photometric rendering is crucial for downstream pose estimation tasks in spacecraft rendezvous operations.

Method: Uses 3D Gaussian Splatting model with Sun position priors (estimated by servicer spacecraft) incorporated into training pipeline to improve photometric quality and handle changing illumination.

Result: 3DGS models learn to adapt to rapidly changing space illumination, capturing global shadowing and self-occlusion effects, improving photometric accuracy for pose estimation.

Conclusion: Incorporating Sun position knowledge enables effective 3D reconstruction of spacecraft in dynamic lighting conditions, supporting accurate pose estimation for RPO missions.

Abstract: This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target’s geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun’s position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.

[138] Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

Hao Wu, Hui Li, Yiyun Su

Main category: cs.CV

TL;DR: Hilbert-VLM: A two-stage fusion framework for 3D medical image analysis that combines Hilbert-enhanced SAM2 for lesion segmentation with VLM for disease classification, achieving state-of-the-art performance on BraTS2021.

Details

Motivation: VLMs show promise for medical diagnosis but struggle with 3D multimodal medical images due to ineffective information integration and oversight of subtle pathological features.

Method: Two-stage framework: 1) HilbertMed-SAM module with Hilbert space-filling curves in SAM2’s Mamba SSM to preserve 3D spatial locality, plus HMCA and scale-aware decoder for segmentation; 2) Prompt enhancement module unifies masks and textual attributes to guide VLM classification.

Result: Achieves 82.35% Dice score on BraTS2021 segmentation benchmark and 78.85% diagnostic classification accuracy, demonstrating superior performance for medical VLM-based analysis.

Conclusion: Hilbert-VLM significantly improves accuracy and reliability of medical VLM analysis by effectively integrating 3D multimodal information and capturing subtle pathological features through innovative Hilbert-space and prompt enhancement techniques.

Abstract: Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

[139] On Exact Editing of Flow-Based Diffusion Models

Zixiang Li, Yue Song, Jianing Peng, Ting Liu, Jun Huang, Xiaochao Qu, Luoqi Liu, Wei Wang, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: CVC is a flow-based diffusion editing framework that corrects velocity errors in latent trajectories using a dual-perspective velocity conversion mechanism and posterior-consistent updates for stable, faithful image editing.

Details

Motivation: Current flow-based diffusion editing methods suffer from accumulated velocity errors in latent trajectories, leading to semantic inconsistency and loss of structural fidelity during image transformations.

Method: Proposes Conditioned Velocity Correction (CVC) with dual-perspective velocity conversion: structure-preserving branch maintains source trajectory consistency, and semantically-guided branch drives controlled deviation toward target distribution. Uses posterior-consistent updates derived from Empirical Bayes Inference and Tweedie correction to compensate for velocity errors.

Result: CVC achieves stable and interpretable latent dynamics with faithful reconstruction and smooth local semantic conversion. Demonstrates superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks compared to existing methods.

Conclusion: CVC provides a principled framework for flow-based editing that addresses velocity error accumulation through mathematically grounded correction mechanisms, enabling high-quality image transformations with preserved structural integrity.

Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

[140] FitControler: Toward Fit-Aware Virtual Try-On

Lu Yang, Yicheng Liu, Yanan Li, Xiang Bai, Hao Lu

Main category: cs.CV

TL;DR: FitControler is a learnable plug-in for virtual try-on models that enables customized garment fit control by generating fit-aware layouts and injecting them into existing VTON systems.

Details

Motivation: Current VTON methods focus on garment detail rendering but neglect garment fit - how garments align with the body, which is crucial for holistic style coordination. There's a need for fit-aware VTON that can control garment fit according to user preferences.

Method: FitControler has two main components: 1) a fit-aware layout generator that redraws body-garment layouts using garment-agnostic representations, and 2) a multi-scale fit injector that delivers layout cues for layout-driven VTON. The system is trained on Fit4Men dataset containing 13,000 body-garment pairs with different fits.

Result: Extensive experiments show FitControler works with various VTON models and achieves accurate fit control. The method introduces two fit consistency metrics for evaluation and demonstrates effective integration as a plug-in module.

Conclusion: FitControler successfully addresses the gap in fit-aware virtual try-on by providing a learnable plug-in that enables customized garment fit control while maintaining compatibility with existing VTON architectures.

Abstract: Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style – garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.

[141] Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression

Huanxiong Liang, Yunuo Chen, Yicheng Pan, Sixian Wang, Jincheng Dai, Guo Lu, Wenjun Zhang

Main category: cs.CV

TL;DR: Proposes structure-guided allocation for 2D Gaussian Splatting to improve rate-distortion efficiency while maintaining fast decoding speed.

Details

Motivation: Existing 2DGS pipelines allocate representation capacity and parameter precision without considering image structure, limiting rate-distortion efficiency at low bitrates.

Method: 1) Structure-guided initialization assigning 2D Gaussians based on spatial structural priors; 2) Adaptive bitwidth quantization of covariance parameters (higher precision for complex regions); 3) Geometry-consistent regularization aligning Gaussian orientations with local gradient directions.

Result: Substantially improves representational power and RD performance while maintaining over 1000 FPS decoding. Reduces BD-rate by 43.44% on Kodak and 29.91% on DIV2K compared to baseline GSImage.

Conclusion: Structure-guided allocation principle effectively couples image structure with representation capacity and quantization precision, enabling improved rate-distortion efficiency in 2D Gaussian Splatting while preserving fast decoding speed.

Abstract: Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.

[142] FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

Yunkai Dang, Donghao Wang, Jiacheng Yang, Yifan Jiang, Meiyi Zhu, Yuekun Yang, Cong Wang, Qi Fan, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: MF-RSVLM is a multi-feature fusion vision-language model designed for remote sensing that addresses visual forgetting and fine-grained feature extraction challenges through multi-scale representations and recurrent visual injection.

Details

Motivation: Existing vision-language models struggle with remote sensing images due to differences from natural images, failing to extract fine-grained visual features and suffering from visual forgetting during language processing.

Method: Proposes MF-RSVLM with multi-scale visual representation learning, combining global context with local details, and a recurrent visual feature injection scheme to maintain visual grounding during language generation.

Result: Achieves state-of-the-art or highly competitive performance on diverse remote sensing benchmarks including classification, image captioning, and VQA tasks.

Conclusion: MF-RSVLM effectively addresses remote sensing vision-language challenges through multi-feature fusion and recurrent visual injection, demonstrating superior performance across multiple RS tasks.

Abstract: Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision–Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

[143] RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, Wenqiang Zhang

Main category: cs.CV

TL;DR: RSAgent is an agentic MLLM that performs text-guided segmentation through multi-turn reasoning and tool interactions, achieving state-of-the-art performance on segmentation benchmarks.

Details

Motivation: Current text-guided segmentation methods treat it as one-shot grounding, which lacks verification, refocusing, and refinement capabilities when initial localization is wrong. There's a need for iterative reasoning and refinement in segmentation tasks.

Method: Proposes RSAgent, an agentic Multimodal Large Language Model that interleaves reasoning and action via multi-turn tool invocations. It queries a segmentation toolbox, observes visual feedback, and revises spatial hypotheses using historical observations. Uses a two-stage training framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with task-specific rewards.

Result: Achieves 66.5% gIoU on ReasonSeg test (9% improvement over Seg-Zero-7B) and 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

Conclusion: RSAgent’s agentic approach with multi-turn reasoning and tool interactions enables more accurate and robust text-guided segmentation compared to one-shot methods, showing the value of iterative refinement in segmentation tasks.

Abstract: Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

[144] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

Mustafa Munir, Md Mostafijur Rahman, Kartikeya Bhardwaj, Paul Whatmough, Radu Marculescu

Main category: cs.CV

TL;DR: PipeFlow is a scalable video editing method that uses motion analysis to skip low-motion frames, pipelined scheduling for parallel processing, and neural interpolation to enable linear-time editing of long videos.

Details

Motivation: Long-form video editing faces computational challenges due to high costs from joint editing and DDIM inversion across extended sequences, making current methods inefficient for long videos.

Method: Three key innovations: 1) Motion analysis using SSIM and Optical Flow to skip editing of low-motion frames, 2) Pipelined task scheduling that splits videos into segments for parallel DDIM inversion and joint editing, 3) Neural network-based interpolation to smooth border frames and interpolate skipped frames.

Result: PipeFlow achieves linear scaling with video length, enabling editing of infinitely long videos without growing per-frame overhead. It achieves up to 9.6X speedup over TokenFlow and 31.7X speedup over Diffusion Motion Transfer.

Conclusion: PipeFlow provides a scalable solution for long-form video editing by addressing computational bottlenecks through motion-aware frame skipping, parallel processing, and intelligent interpolation, making long video editing practical and efficient.

Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow’s editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

[145] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

Xinran Qin, Yuhui Quan, Ruotao Xu, Hui Ji

Main category: cs.CV

TL;DR: A reinforcement learning-based anisotropic diffusion framework for image denoising that uses deep Q-learning to learn optimal diffusion action sequences, outperforming traditional diffusion methods and competing with deep CNN approaches.

Details

Motivation: Traditional anisotropic diffusion methods use explicit diffusion operators that are not well-adapted to complex image structures, limiting their performance compared to learning-based approaches. The authors aim to create a more adaptive diffusion-based denoiser.

Method: Proposes a trainable anisotropic diffusion framework using reinforcement learning. Models denoising as a series of naive diffusion actions with order learned by deep Q-learning, creating a stochastic anisotropic diffusion process adaptive to different image structures.

Result: The method outperforms existing diffusion-based methods and competes with representative deep CNN-based methods for removing three types of often-seen noise.

Conclusion: The reinforcement learning approach enables adaptive anisotropic diffusion that effectively handles complex image structures, bridging the gap between traditional diffusion methods and modern learning-based approaches.

Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

[146] Pathology Context Recalibration Network for Ocular Disease Recognition

Zunjie Xiao, Xiaoqing Zhang, Risa Higashita, Jiang Liu

Main category: cs.CV

TL;DR: PCRNet: A deep learning framework that incorporates pathology context and expert experience priors for improved ocular disease recognition with better interpretability.

Details

Motivation: Current DNNs for ocular disease diagnosis ignore clinical pathology context and expert experience priors, limiting both performance and interpretability of decision-making.

Method: Proposes PCRNet with two novel components: Pathology Recalibration Module (PRM) for pathology context prior via pixel-wise context compression and distribution concentration, and Expert Prior Guidance Adapter (EPGA) for highlighting significant regions using expert experience. Also introduces Integrated Loss (IL) considering sample-wise loss distributions and label frequencies.

Result: Extensive experiments on three ocular disease datasets show PCRNet with IL outperforms state-of-the-art attention-based networks and advanced loss methods.

Conclusion: PCRNet effectively incorporates clinical pathology context and expert experience priors to enhance ocular disease recognition performance and decision-making interpretability, with visualization analysis explaining the inherent behavior of the proposed modules.

Abstract: Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.

[147] Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Jingzhou Chen, Dexin Chen, Fengchao Xiong, Yuntao Qian, Liang Xiao

Main category: cs.CV

TL;DR: Proposes balanced hierarchical contrastive loss with decoupled learning in DETR to address hierarchical label imbalance and localization interference in fine-grained remote sensing detection.

Details

Motivation: Fine-grained remote sensing uses hierarchical labels, but embedding semantic hierarchy into representation learning is challenging. Previous contrastive learning approaches overlook two issues: (1) imbalanced data distribution across label hierarchy causing high-frequency classes to dominate learning, and (2) semantic relationship learning interfering with class-agnostic localization.

Method: Proposes balanced hierarchical contrastive loss with learnable class prototypes that equilibrates gradients from different classes at each hierarchical level. Uses decoupled learning strategy within DETR framework, separating object queries into classification and localization sets for task-specific feature extraction and optimization.

Result: Experiments on three fine-grained datasets with hierarchical annotations demonstrate that the method outperforms state-of-the-art approaches.

Conclusion: The proposed balanced hierarchical contrastive loss with decoupled learning effectively addresses hierarchical label imbalance and localization interference, improving fine-grained detection performance in remote sensing applications.

Abstract: Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR’s object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

[148] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Aiyue Chen, Yaofu Liu, Junjian Huang, Guang Lian, Yiwu Yao, Wangli Lan, Jing Lin, Zhixin Ma, Tingting Zhou, Harry Yang

Main category: cs.CV

TL;DR: RainFusion2.0 is a hardware-efficient sparse attention mechanism that accelerates Diffusion Transformer models for video/image generation by achieving 80% sparsity with 1.5-1.8x speedup while maintaining quality across diverse hardware platforms.

Details

Motivation: DiT models have extremely high computational costs due to attention mechanisms, limiting practical applications. Existing sparse attention methods have overhead issues and lack hardware generality, being mostly GPU-focused despite increasing adoption of diverse hardware like ASICs.

Method: Proposes RainFusion2.0 with three key techniques: (1) using block-wise mean values as representative tokens for sparse mask prediction, (2) implementing spatiotemporal-aware token permutation, and (3) introducing a first-frame sink mechanism specifically for video generation scenarios.

Result: Achieves 80% sparsity with 1.5-1.8x end-to-end speedup without compromising video quality. Demonstrates effectiveness across various generative models and validates generalization across diverse hardware platforms.

Conclusion: RainFusion2.0 successfully addresses computational bottlenecks in DiT models through an online adaptive, hardware-efficient sparse attention mechanism that works across multiple hardware platforms while maintaining generation quality.

Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

[149] Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Yijie Qian, Juncheng Wang, Yuxiang Feng, Chao Xu, Wang Lu, Yang Liu, Baigui Sun, Yiqiang Chen, Yong Liu, Shujun Wang

Main category: cs.CV

TL;DR: LMR introduces a two-stage “Think-then-Act” architecture for text-to-motion generation, addressing the semantic-kinematic impedance mismatch by separating reasoning and execution in latent space.

Details

Motivation: Current T2M approaches face a fundamental bottleneck called Semantic-Kinematic Impedance Mismatch - the difficulty of directly mapping discrete linguistic intent to continuous, high-frequency motion data in a single step.

Method: Proposes Latent Motion Reasoning (LMR) with a Dual-Granularity Tokenizer that disentangles motion into two latent manifolds: Reasoning Latent for global planning and Execution Latent for physical fidelity. Uses a two-stage autoregressive process: reason (plan) then act (execute).

Result: LMR shows non-trivial improvements in both semantic alignment and physical plausibility when implemented on two representative baselines (T2M-GPT and MotionStreamer), validating that motion-aligned concept space is better than natural language for motion planning.

Conclusion: The optimal substrate for motion planning is not natural language but a learned, motion-aligned concept space. Architectural shift to latent System 2 reasoning effectively bridges the gap between language and physics.

Abstract: Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR’s versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{https://chenhaoqcdyq.github.io/LMR/}{https://chenhaoqcdyq.github.io/LMR/}

[150] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Yongtao Chen, Yanbo Wang, Wentao Zhao, Guole Shen, Tianchen Deng, Jingchuan Wang

Main category: cs.CV

TL;DR: Training-free generative adversarial attack framework using diffusion models to create naturalistic adversarial objects that fool monocular depth estimation in autonomous driving systems.

Details

Motivation: Existing physical attacks on monocular depth estimation rely on texture-based patches with strict placement constraints and limited realism, reducing effectiveness in complex driving environments. Errors in depth estimation can propagate through downstream decision making and impact traffic safety.

Method: Training-free generative adversarial attack framework with: 1) Salient Region Selection module to identify regions most influential to MDE, and 2) Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by pre-trained diffusion model. Uses diffusion-based conditional generation to create scene-consistent adversarial objects.

Result: Method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability. Generates physically plausible adversarial objects capable of inducing substantial adversarial depth shifts.

Conclusion: The framework demonstrates strong practical implications for autonomous driving safety assessment by creating more realistic and effective adversarial attacks that can better evaluate system vulnerabilities.

Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

[151] GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Zijun Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, Junchi Yan

Main category: cs.CV

TL;DR: GeoBench is a new hierarchical benchmark for evaluating geometric reasoning in vision-language models, addressing limitations of existing benchmarks through four reasoning levels and six formally verified tasks.

Details

Motivation: Current evaluations of geometric reasoning in VLMs have limitations: risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity.

Method: Created GeoBench with four hierarchical reasoning levels (Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, Self-Reflective Backtracking) and six formally verified tasks generated via TrustGeoGen to systematically assess capabilities from attribute extraction to logical error correction.

Result: Reasoning models like OpenAI-o3 outperform general MLLMs, but performance declines significantly with increasing task complexity. Sub-goal decomposition and irrelevant premise filtering critically influence final accuracy, while Chain-of-Thought prompting unexpectedly degrades performance in some tasks.

Conclusion: GeoBench establishes a comprehensive benchmark for geometric problem-solving and offers actionable guidelines for developing geometric reasoning systems, highlighting the importance of structured reasoning approaches over simple prompting techniques.

Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

[152] Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

Main category: cs.CV

TL;DR: This paper introduces FSAP (Few-Shot Architecture Prompting) for LLM-based neural architecture generation in computer vision, finding n=3 examples optimal, and proposes Whitespace-Normalized Hash Validation for efficient deduplication, validated across 7 vision benchmarks.

Details

Motivation: Automated neural architecture design is challenging due to task diversity and computational constraints. LLMs offer a promising alternative to expensive NAS methods, but their application to computer vision architecture generation lacks systematic study, particularly regarding prompt engineering and validation strategies.

Method: 1) Introduces Few-Shot Architecture Prompting (FSAP) - systematic study of supporting examples (n=1-6) for LLM-based architecture generation. 2) Proposes Whitespace-Normalized Hash Validation - lightweight deduplication method (＜1ms) for preventing redundant training. 3) Uses dataset-balanced evaluation methodology for comparing architectures across heterogeneous vision tasks.

Result: Found n=3 examples best balances architectural diversity and context focus for vision tasks. Generated 1,900 unique architectures across 7 computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365). Validation method provides 100x speedup over AST parsing.

Conclusion: Provides actionable guidelines for LLM-based architecture search in computer vision and establishes rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

Abstract: Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

[153] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li

Main category: cs.CV

TL;DR: The paper addresses Preference Mode Collapse (PMC) in text-to-image diffusion models aligned via RLHF, where models converge on narrow high-scoring outputs, degrading diversity. It introduces DivGenBench to measure PMC and proposes D²-Align framework that directionally corrects reward signals to maintain diversity while aligning with human preferences.

Details

Motivation: Existing RLHF methods for text-to-image diffusion models achieve high automated reward scores but suffer from Preference Mode Collapse - a form of reward hacking where models converge on narrow, high-scoring outputs (monolithic styles, overexposure), severely degrading generative diversity.

Method: Proposes Directional Decoupling Alignment (D²-Align): 1) Learns a directional correction within the reward model’s embedding space while keeping the model frozen, 2) Applies this correction to the reward signal during optimization to prevent mode collapse and maintain diversity. Also introduces DivGenBench benchmark to quantify PMC.

Result: Comprehensive evaluation combining qualitative analysis with quantitative metrics for both quality and diversity shows that D²-Align achieves superior alignment with human preference while maintaining diversity, effectively mitigating Preference Mode Collapse.

Conclusion: The paper successfully identifies and quantifies Preference Mode Collapse in RLHF-aligned text-to-image models, proposes a novel benchmark (DivGenBench) to measure it, and introduces D²-Align framework that directionally corrects reward signals to prevent mode collapse while maintaining alignment with human preferences.

Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model’s inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model’s embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

[154] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

TsaiChing Ni, ZhenQi Chen, YuanFu Yang

Main category: cs.CV

TL;DR: IMDD-1M is a large-scale industrial multimodal defect dataset with 1M image-text pairs, used to train a diffusion-based vision-language foundation model for efficient industrial inspection and generation tasks.

Details

Motivation: There's a need for large-scale multimodal datasets and foundation models specifically designed for industrial defect detection and quality inspection, which can enable more efficient and scalable manufacturing intelligence.

Method: Created IMDD-1M dataset with 1M aligned image-text pairs covering 60+ material categories and 400+ defect types, then trained a diffusion-based vision-language foundation model from scratch tailored for industrial scenarios.

Result: The foundation model achieves comparable performance to dedicated expert models using less than 5% of task-specific data, demonstrating efficient adaptation through lightweight fine-tuning for various industrial applications.

Conclusion: IMDD-1M and the trained foundation model enable scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence, paving the way for data-efficient industrial inspection and generation systems.

Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

[155] Bayesian Self-Distillation for Image Classification

Anton Adelöw, Matteo Gamba, Atsuto Maki

Main category: cs.CV

TL;DR: BSD uses Bayesian inference on model’s own predictions to create sample-specific soft targets, eliminating hard targets after initialization, improving accuracy, calibration, and robustness.

Details

Motivation: Hard targets in supervised training cause overconfidence, poor calibration, and limited generalization/robustness. Existing self-distillation methods still rely on hard targets, reducing their effectiveness.

Method: Bayesian Self-Distillation (BSD) constructs sample-specific target distributions via Bayesian inference using the model’s own predictions, eliminating dependence on hard targets after initialization.

Result: BSD consistently improves test accuracy (+1.4% ResNet-50 on CIFAR-100) and reduces Expected Calibration Error (-40% ResNet-50, CIFAR-100) across architectures/datasets. Also improves robustness against data corruptions, perturbations, and label noise.

Conclusion: BSD provides a principled approach to self-distillation that eliminates hard target dependence, achieving better accuracy, calibration, and robustness than existing methods, with state-of-the-art label noise robustness when combined with contrastive loss.

Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model’s own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model’s own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

[156] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng

Main category: cs.CV

TL;DR: DiffThinker introduces a diffusion-based generative multimodal reasoning framework that treats reasoning as an image-to-image task, outperforming text-centric MLLMs in vision-centric tasks.

Details

Motivation: Current Multimodal Large Language Models (MLLMs) have text-centric reasoning processes that perform poorly on complex long-horizon, vision-centric tasks, creating a need for native visual reasoning approaches.

Method: DiffThinker establishes a Generative Multimodal Reasoning paradigm using diffusion models to reformulate multimodal reasoning as a native generative image-to-image task, enabling direct visual reasoning.

Result: DiffThinker significantly outperforms leading models: +314.2% over GPT-5, +111.6% over Gemini-3-Flash, and +39.0% over fine-tuned Qwen3-VL-32B across four domains (sequential planning, combinatorial optimization, constraint satisfaction, spatial configuration).

Conclusion: Generative multimodal reasoning via diffusion models represents a promising approach for vision-centric reasoning, offering superior logical consistency, spatial precision, and core properties including efficiency, controllability, native parallelism, and collaboration.

Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

[157] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

Yu-Tang Chang, Pin-Wei Chen, Shih-Fang Chen

Main category: cs.CV

TL;DR: DGC is a memory-efficient HSI segmentation framework that learns global clustering from local patches without pre-training, enabling fast training on consumer hardware but suffers from optimization instability due to loss balancing issues.

Details

Motivation: HSI analysis faces computational bottlenecks from massive data volumes, and existing foundation models pre-trained on remote sensing data fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures and spatial scales differ fundamentally.

Method: Deep Global Clustering (DGC) learns global clustering structure from local patch observations without pre-training, operating on small patches with overlapping regions to enforce consistency, maintaining constant memory usage and enabling training in under 30 minutes on consumer hardware.

Result: On a leaf disease dataset, DGC achieves background-tissue separation with mean IoU 0.925 and demonstrates unsupervised disease detection through navigable semantic granularity, but suffers from optimization instability due to cluster over-merging in feature space.

Conclusion: The DGC framework has merit as intellectual scaffolding with a promising design philosophy for memory-efficient HSI segmentation, but stable implementation requires principled approaches to dynamic loss balancing to address optimization instability issues.

Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at https://github.com/b05611038/HSI_global_clustering.

[158] Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu

Main category: cs.CV

TL;DR: Internal Guidance (IG) improves diffusion model training by adding auxiliary supervision on intermediate layers and extrapolating outputs during sampling, achieving state-of-the-art FID scores on ImageNet.

Details

Motivation: Standard diffusion models struggle with low-probability areas and require guidance strategies like CFG, which can cause over-simplified or distorted samples. Existing guidance methods have limitations like needing carefully designed degradation strategies, extra training, or additional sampling steps.

Method: Proposes Internal Guidance (IG) which adds auxiliary supervision on intermediate layers during training, then extrapolates intermediate and deep layer outputs during sampling to obtain generative results. This is a simple strategy that doesn’t require complex degradation strategies or extra sampling steps.

Result: Significant improvements in training efficiency and generation quality across various baselines. On ImageNet 256x256: SiT-XL/2+IG achieves FID=5.31 (80 epochs) and FID=1.75 (800 epochs). LightningDiT-XL/1+IG achieves FID=1.34, and combined with CFG achieves state-of-the-art FID=1.19.

Conclusion: Internal Guidance is a simple yet effective strategy that enhances diffusion model performance without complex modifications, achieving state-of-the-art results on ImageNet generation tasks.

Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer’s outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

[159] PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds

Pieter M. Blok, Haozhou Wang, Hyun Kwon Suh, Peicheng Wang, James Burridge, Wei Guo

Main category: cs.CV

TL;DR: PointRAFT is a high-throughput point cloud regression network that directly predicts potato tuber weight from partial RGB-D point clouds, overcoming self-occlusion issues without full 3D reconstruction.

Details

Motivation: Potato yield estimation using RGB-D cameras on harvesters is hindered by incomplete point clouds due to self-occlusion, leading to systematic underestimation of tuber weight. Current methods struggle with partial data and high-throughput requirements.

Method: PointRAFT directly predicts continuous 3D shape properties (like tuber weight) from partial point clouds using a novel object height embedding that incorporates tuber height as additional geometric cue. It bypasses full 3D reconstruction and works directly on raw 3D data.

Result: On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved MAE of 12.0g and RMSE of 17.2g, outperforming linear regression and PointNet++ baselines. With 6.3ms inference time per point cloud, it supports 150 tubers/second processing.

Conclusion: PointRAFT provides an accurate, high-throughput solution for potato weight estimation on commercial harvesters and offers a versatile regression framework for various 3D phenotyping and robotic perception tasks.

Abstract: Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at https://github.com/pieterblok/pointraft.git.

[160] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim

Main category: cs.CV

TL;DR: CorGi is a training-free inference acceleration framework for Diffusion Transformers that reduces redundant computation by caching and reusing low-contribution transformer blocks across denoising steps, achieving up to 2.0x speedup while maintaining generation quality.

Details

Motivation: Diffusion Transformers (DiT) have high inference costs due to iterative denoising processes combined with large model capacity, and recent research shows substantial redundant computation across denoising steps that can be optimized.

Method: CorGi uses contribution-guided block-wise interval caching to selectively reuse outputs of low-contribution transformer blocks across denoising steps. CorGi+ extends this for text-to-image tasks by leveraging cross-attention maps to identify salient tokens and applying partial attention updates to protect important object details.

Result: Evaluation on state-of-the-art DiT models shows CorGi and CorGi+ achieve up to 2.0x speedup on average while preserving high generation quality.

Conclusion: The proposed training-free acceleration framework effectively reduces redundant computation in DiT inference through smart caching and selective reuse strategies, making DiT models more practical for real-world applications without sacrificing quality.

Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

[161] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Sina Jahromi, Farshid Hajati, Alireza Rezaee, Javaher Nourian

Main category: cs.CV

TL;DR: A progressive GAN with weighted synthetic data augmentation and meta-heuristic optimization achieves high accuracy on imbalanced COVID-19 chest X-ray classification.

Details

Motivation: Medical image classification faces severe class imbalance issues, especially during pandemics like COVID-19 where infected cases are rare compared to normal cases. This imbalance hinders AI/ML methods that require balanced data for accurate disease detection.

Method: Proposes a progressive generative adversarial network (GAN) to generate synthetic data for minority classes. Uses weighted combination of synthetic and real data before feeding to deep classifier. Employs multi-objective meta-heuristic population-based optimization to tune classifier hyperparameters.

Result: Achieves 95.5% accuracy for 4-class and 98.5% accuracy for 2-class imbalanced COVID-19 chest X-ray classification, outperforming existing methods on cross-validated metrics.

Conclusion: The proposed model effectively addresses medical image classification with imbalanced data during pandemics, demonstrating superior performance through synthetic data generation and optimization techniques.

Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

[162] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Ziquan Liu, Zhewei Zhu, Xuyang Shi

Main category: cs.CV

TL;DR: ARM is a lightweight learnable module that refines CLIP’s internal features for open-vocabulary semantic segmentation, trained once on COCO-Stuff and used as a universal plug-and-play post-processor for training-free methods.

Details

Motivation: Existing training-free OVSS methods are either computationally expensive (relying on external models like SAM/DINO) or sub-optimal (using static heuristics on CLIP features). CLIP's image-level representations lack pixel-level details needed for segmentation.

Method: ARM uses a semantically-guided cross-attention block that fuses hierarchical CLIP features: deep features (K,V) select and refine detail-rich shallow features (Q), followed by self-attention. It follows a “train once, use anywhere” paradigm.

Result: ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

Conclusion: ARM effectively unlocks CLIP’s internal potential for segmentation through adaptive feature fusion, offering a lightweight, universal solution that outperforms existing training-free methods without computational burden.

Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP’s internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP’s internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

[163] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

Shuyun Wang, Haiyang Sun, Bing Wang, Hangjun Ye, Xin Yu

Main category: cs.CV

TL;DR: Mirage is a one-step video diffusion model for photorealistic and temporally coherent asset editing in driving scenes, addressing fidelity and alignment issues through 2D-3D feature fusion and two-stage data alignment.

Details

Motivation: Vision-centric autonomous driving needs diverse training data, but existing video object editing methods struggle with maintaining both visual fidelity and temporal coherence for effective data augmentation.

Method: 1) Builds on text-to-video diffusion prior for temporal consistency; 2) Injects temporally agnostic latents from pretrained 2D encoder into 3D decoder to restore detail while preserving causal structures; 3) Introduces two-stage data alignment combining coarse 3D alignment and fine 2D refinement to address distribution mismatch between scene objects and inserted assets.

Result: Extensive experiments show Mirage achieves high realism and temporal consistency across diverse editing scenarios, and can generalize to other video-to-video translation tasks.

Conclusion: Mirage provides a reliable baseline for photorealistic and coherent asset editing in driving scenes, with potential applications beyond asset editing to other video-to-video translation tasks in autonomous driving research.

Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.

[164] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

Rahul Medicharla, Alper Yilmaz

Main category: cs.CV

TL;DR: MotivNet is a generalizable facial emotion recognition model that uses Meta’s Sapiens vision foundation model as backbone, achieving competitive cross-domain performance without cross-domain training.

Details

Motivation: Current FER models have weak generalization on diverse real-world data, requiring cross-domain training which is impractical for real-world applications. There's a need for models that generalize well without such training.

Method: Uses Meta’s Sapiens (human vision foundational model with Masked Autoencoder pretraining) as backbone. MotivNet is added as downstream task to Sapiens. Three evaluation criteria defined: benchmark performance, model similarity, and data similarity.

Result: MotivNet achieves competitive performance across datasets without cross-domain training, meets all three evaluation criteria, and validates as a viable Sapiens downstream task.

Conclusion: MotivNet demonstrates strong generalization across domains, making FER more practical for real-world applications. The approach validates using vision foundation models for emotion recognition tasks.

Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet’s viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at https://github.com/OSUPCVLab/EmotionFromFaceImages.

[165] MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

Fuqiang Gu, Yuanke Li, Xianlei Long, Kangping Ji, Chao Chen, Qingyi Gu, Zhenliang Ni

Main category: cs.CV

TL;DR: MambaSeg is a dual-branch semantic segmentation framework using parallel Mamba encoders for RGB and event data fusion, with a Dual-Dimensional Interaction Module for spatial and temporal alignment.

Details

Motivation: RGB-based segmentation degrades under challenging conditions (fast motion, low-light, HDR), while event cameras alone lack color/texture. Existing multimodal fusion approaches are computationally expensive and neglect temporal dynamics of event streams.

Method: Proposes MambaSeg with parallel Mamba encoders for RGB and event streams. Introduces Dual-Dimensional Interaction Module (DDIM) with Cross-Spatial Interaction Module (CSIM) and Cross-Temporal Interaction Module (CTIM) for fine-grained spatial and temporal fusion.

Result: Achieves state-of-the-art segmentation performance on DDD17 and DSEC datasets while significantly reducing computational cost compared to existing methods.

Conclusion: MambaSeg demonstrates promise for efficient, scalable, and robust multimodal perception by effectively leveraging complementary properties of RGB and event modalities through spatial-temporal fusion.

Abstract: Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

[166] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

Zhi Li, Yaqi Wang, Bingtao Ma, Yifan Zhang, Huiyu Zhou, Shuai Wang

Main category: cs.CV

TL;DR: PGMP framework for dental CBCT metal artifact reduction uses physics-based simulation for training data, deterministic manifold projection for fast inference, and medical foundation model priors for clinical reliability.

Details

Motivation: Current deep learning methods for metal artifact reduction in dental CBCT have limitations: supervised methods cause spectral blurring from regression-to-the-mean, unsupervised methods risk structural hallucinations, and diffusion models are too slow for clinical use due to iterative sampling.

Method: Three components: 1) Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs using Monte Carlo spectral modeling and patient-specific digital twins; 2) DMP-Former adapts direct x-prediction paradigm for deterministic manifold projection in single forward pass; 3) Semantic-Structural Alignment (SSA) module uses medical foundation model priors (MedDINOv3) to ensure clinical plausibility.

Result: PGMP outperforms state-of-the-art methods on both synthetic and multi-center clinical datasets, particularly on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability.

Conclusion: The proposed PGMP framework successfully addresses limitations of existing MAR methods by combining physics-based simulation, deterministic restoration, and medical foundation model guidance, making it suitable for clinical applications requiring both accuracy and efficiency.

Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to “regression-to-the-mean”, while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP

[167] Taming Hallucinations: Boosting MLLMs’ Video Understanding via Counterfactual Video Generation

Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang

Main category: cs.CV

TL;DR: DualityForge framework synthesizes counterfactual video QA pairs using diffusion-based editing to reduce MLLM hallucinations, achieving 24% improvement over baseline.

Details

Motivation: MLLMs suffer from visual ungrounded hallucinations due to over-reliance on language priors, especially with counterfactual videos. This stems from text-video data imbalance and high cost of collecting counterfactual data.

Method: DualityForge uses controllable diffusion-based video editing to transform real videos into counterfactual scenarios, automatically generating QA pairs. Duality-Normalized Advantage Training (DNA-Train) applies two-stage SFT-RL with pair-wise ℓ₁ advantage normalization for stable optimization.

Result: 24.0% relative improvement over Qwen2.5-VL-7B baseline on counterfactual videos. Significant gains on both hallucination and general-purpose benchmarks, showing strong generalization.

Conclusion: The framework effectively reduces MLLM hallucinations by synthesizing counterfactual data and using contrastive training, with promising generalization capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

[168] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Jia Yu, Yan Zhu, Peiyao Fu, Tianyi Chen, Zhihua Wang, Fei Wu, Quanlin Li, Pinghong Zhou, Shuo Wang, Xian Yang

Main category: cs.CV

TL;DR: EndoRare is a one-shot generative framework that synthesizes diverse, high-fidelity rare gastrointestinal lesion images from a single reference, enhancing AI model performance and clinical training for rare diseases.

Details

Motivation: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, limiting data availability for developing reliable AI models and training novice clinicians, creating a "rare-disease gap" in both computer-aided diagnostics and clinical education.

Method: EndoRare uses language-guided concept disentanglement to separate pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. It’s a one-shot, retraining-free generative framework that works from a single reference image.

Result: Validated across four rare pathologies, synthetic images were clinically plausible and significantly enhanced downstream AI classifiers (improving true positive rate at low false-positive rates). Novice endoscopists exposed to EndoRare-generated cases achieved 0.400 increase in recall and 0.267 increase in precision in blinded reader studies.

Conclusion: EndoRare establishes a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education by generating diverse, high-fidelity lesion exemplars from minimal data.

Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

[169] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Md. Enamul Hoq, Linda Larson-Prior, Fred Prior

Main category: cs.CV

TL;DR: Virtual-Eyes CT preprocessing pipeline improves generalist foundation models but degrades specialist models in lung cancer screening.

Details

Motivation: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT lung cancer screening, and its differential impact on generalist vs specialist models is unknown.

Method: Developed Virtual-Eyes, a 16-bit CT quality-control pipeline that enforces 512x512 resolution, rejects non-diagnostic series, extracts contiguous lung blocks using Hounsfield-unit filtering and lung-coverage scoring. Evaluated on 765 NLST patients using RAD-DINO, Merlin, Sybil, and ResNet-18 models with frozen encoders.

Result: Virtual-Eyes improved RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and 0.619 to 0.735 (max pooling). Sybil and ResNet-18 degraded under Virtual-Eyes, and Merlin showed limited transferability regardless of preprocessing.

Conclusion: Anatomically targeted quality control can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

[170] UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang

Main category: cs.CV

TL;DR: UniAct is a two-stage framework that integrates a fine-tuned multimodal language model with a causal streaming pipeline to enable humanoid robots to execute diverse multimodal instructions with sub-500 ms latency, achieving 19% improvement in zero-shot motion tracking.

Details

Motivation: The paper addresses the challenge of bridging high-level multimodal perception with whole-body execution in humanoid robotics. Existing methods struggle to translate heterogeneous instructions (language, music, trajectories) into stable, real-time actions, creating a bottleneck for creating versatile agents with human-level flexibility.

Method: UniAct uses a two-stage framework: 1) A fine-tuned multimodal language model (MLLM) for instruction understanding, and 2) A causal streaming pipeline for motion generation. The system unifies diverse inputs through a shared discrete codebook using FSQ (Finite Scalar Quantization), ensuring cross-modal alignment while constraining motions to a physically grounded manifold.

Result: The approach achieves sub-500 ms latency for executing multimodal instructions and yields a 19% improvement in success rate for zero-shot tracking of imperfect reference motions. The system was validated on UniMoCap, a 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios.

Conclusion: UniAct represents a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control, addressing the long-standing challenge of bridging multimodal instruction understanding with real-time whole-body execution.

Abstract: A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions – such as language, music, and trajectories – into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang, Liang Lin

Main category: cs.CV

TL;DR: CERES is a causal framework that adapts pre-trained RVOS models to egocentric videos by addressing language biases and visual confounding factors through dual-modal causal intervention.

Details

Motivation: Existing Ego-RVOS methods struggle with dataset biases (skewed object-action pairings) and egocentric visual challenges (rapid motion, occlusions), leading to spurious correlations and poor robustness.

Method: CERES implements dual-modal causal intervention: 1) backdoor adjustment to counteract language representation biases from dataset statistics, and 2) front-door adjustment to integrate semantic visual features with geometric depth information guided by causal principles.

Result: Extensive experiments show CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, demonstrating improved robustness to egocentric distortions.

Conclusion: CERES highlights the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding by addressing fundamental biases and confounding factors.

Abstract: Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

[172] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu

Main category: cs.CV

TL;DR: SenseNova-MARS is a multimodal agentic reasoning framework that uses RL to enable VLMs to interleave visual reasoning with tool manipulation (search, cropping) for complex visual tasks.

Details

Motivation: Current VLMs are limited to text-oriented reasoning or isolated tool use, lacking the human-like ability to seamlessly interleave dynamic tool manipulation with continuous reasoning in knowledge-intensive, visually complex scenarios requiring coordinated external tools.

Method: Introduces SenseNova-MARS framework with RL-based training using BN-GSPO algorithm to improve stability and tool-use reasoning. Dynamically integrates image search, text search, and image crop tools. Also creates HR-MMSearch benchmark for evaluation.

Result: Achieves SOTA on search and fine-grained image understanding benchmarks. SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models like Gemini-3-Flash and GPT-5.

Conclusion: SenseNova-MARS represents progress toward agentic VLMs with effective tool-use capabilities. The authors will release all code, models, and datasets to facilitate further research.

Abstract: While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model’s ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

[173] Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

Main category: cs.CV

TL;DR: LVLDrive enhances Vision-Language Models for autonomous driving by incorporating LiDAR point clouds to improve 3D spatial understanding, addressing limitations of image-only methods.

Details

Motivation: Current Vision-Language Models (VLMs) for autonomous driving rely on 2D images, which struggle with accurate metric spatial reasoning and geometric inference, creating safety and reliability bottlenecks. The paper aims to bridge the gap between VLMs' language-based common sense and robust 3D metric understanding needed for safe autonomous driving.

Method: Proposes LVLDrive framework that incorporates LiDAR point clouds as an additional input modality. Introduces Gradual Fusion Q-Former to incrementally inject LiDAR features while preserving pre-trained VLM knowledge. Develops spatial-aware question-answering (SA-QA) dataset to explicitly teach 3D perception and reasoning capabilities.

Result: Extensive experiments on driving benchmarks show LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making.

Conclusion: The work demonstrates the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems, successfully upgrading VLMs with robust 3D spatial understanding while preserving their existing knowledge base.

Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM’s existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

[174] The Mechanics of CNN Filtering with Rectification

Liam Frija-Altrac, Matthew Toews

Main category: cs.CV

TL;DR: The paper proposes elementary information mechanics, a new model linking convolutional filtering with rectification to physical theories of special relativity and quantum mechanics, showing how kernel decomposition into even/odd components relates to energy-momentum dynamics in information processing.

Details

Motivation: To establish a theoretical connection between information processing in convolutional neural networks and fundamental physical principles, specifically the energy-momentum relation from relativistic physics, providing a new mechanical framework for understanding CNN operations.

Method: Decompose convolutional kernels into orthogonal even and odd components, analyze their effects on image content (even components cause isotropic diffusion, odd components cause directional displacement), and examine these properties in the spectral domain using discrete cosine transform (DCT) to identify fundamental modes of information propagation.

Result: Demonstrates that the speed of information displacement is linearly related to the ratio of odd vs total kernel energy, and that small convolutional filters are dominated by low-frequency DCT bases (DC Σ and gradient components ∇), which define fundamental information propagation modes.

Conclusion: This work establishes the first link between information processing in generic CNNs and the energy-momentum relation from modern relativistic physics, proposing elementary information mechanics as a new theoretical framework for understanding convolutional filtering mechanics.

Abstract: This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $Σ$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.

[175] DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella, Roberto Andres Novoa, Josep Malvehy

Main category: cs.CV

TL;DR: DermaVQA-DAS extends dermatological image analysis with patient-centered QA and lesion segmentation using a structured expert framework, benchmarking multimodal models with strong performance.

Details

Motivation: Existing dermatological datasets focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting applicability to patient-centered care.

Method: Introduces DermaVQA-DAS with Dermatology Assessment Schema (DAS) - expert-developed framework with 36 high-level and 27 fine-grained assessment questions. Provides annotated datasets for closed QA and segmentation, benchmarking multimodal models with various prompting strategies.

Result: For segmentation: prompt design impacts performance - augmented prompt with patient query yields best results (Jaccard 0.395, Dice 0.566). For QA: strong performance across models (0.729-0.798 accuracy), with o3 achieving best overall accuracy (0.798).

Conclusion: DermaVQA-DAS addresses patient-centered care gaps in dermatology, provides structured framework and benchmarks, and publicly releases resources to accelerate vision-language modeling research in dermatology.

Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi

Main category: cs.CV

TL;DR: The paper presents a comprehensive framework and taxonomy for multi-modal pre-training to achieve Spatial Intelligence in autonomous systems, analyzing sensor integration, learning strategies, and proposing a roadmap for general-purpose foundation models.

Details

Motivation: The rapid advancement of autonomous systems (self-driving vehicles, drones) has intensified the need for true Spatial Intelligence from multi-modal sensor data. While foundation models excel in single-modal contexts, integrating capabilities across diverse sensors (cameras, LiDAR) to create unified understanding remains a formidable challenge.

Method: The paper presents a comprehensive framework for multi-modal pre-training, identifying core techniques driving progress. It dissects the interplay between foundational sensor characteristics and learning strategies, evaluates platform-specific datasets, and formulates a unified taxonomy for pre-training paradigms ranging from single-modality baselines to sophisticated unified frameworks.

Result: The paper investigates integration of textual inputs and occupancy representations to facilitate open-world perception and planning. It identifies critical bottlenecks such as computational efficiency and model scalability.

Conclusion: The paper proposes a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment in autonomous systems.

Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[177] RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics

Gur-Eyal Sela, Kumar Krishna Agrawal, Bharathan Balaji, Joseph Gonzalez, Ion Stoica

Main category: cs.CV

TL;DR: RedunCut is a dynamic model size selection system for live video analytics that reduces compute costs by 14-62% at fixed accuracy through intelligent sampling and accurate performance prediction.

Details

Motivation: Live video analytics faces high inference costs with modern vision models across massive camera fleets. Existing dynamic model size selection methods fail to generalize to diverse workloads, particularly mobile videos and lower accuracy targets, due to inefficient sampling and inaccurate per-segment accuracy prediction.

Method: RedunCut uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction for dynamic model size selection.

Result: Across road-vehicle, drone, and surveillance videos with multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy while remaining robust to limited historical data and drift.

Conclusion: RedunCut effectively addresses the limitations of prior DMSS systems by optimizing sampling efficiency and improving accuracy prediction, enabling significant compute cost reductions for live video analytics without model retraining.

Abstract: Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.

[178] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen, Haiyang Liu

Main category: cs.CV

TL;DR: DyStream is a real-time talking head video generation system that achieves ultra-low latency (<100ms) for dyadic conversations using flow matching and causal encoding with minimal lookahead.

Details

Motivation: Existing chunk-based methods for talking head video generation require full non-causal context windows, introducing significant delays that prevent immediate non-verbal feedback needed for realistic listener responses in dyadic conversations.

Method: Uses a flow matching-based autoregressive model with two key designs: (1) stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) causal encoder enhanced by a lookahead module that incorporates short future context (60ms) to improve quality while maintaining low latency.

Result: Generates video within 34ms per frame, keeping entire system latency under 100ms. Achieves state-of-the-art lip-sync quality with offline LipSync Confidence score of 8.13 and online score of 7.61 on HDTF dataset.

Conclusion: DyStream provides a simple yet effective solution for real-time dyadic talking head video generation with ultra-low latency, significantly outperforming alternative causal strategies while maintaining high quality lip synchronization.

Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

[179] AI-Driven Evaluation of Surgical Skill via Action Recognition

Yan Meng, Daniel A. Donoho, Marcelle Altshuler, Omar Arnaout

Main category: cs.CV

TL;DR: AI-driven framework automates microanastomosis skill assessment using video transformers and motion analysis, achieving 93.62% action segmentation accuracy and 76% classification accuracy matching expert evaluations.

Details

Motivation: Current surgical proficiency assessment methods are subjective, time-consuming, and require expert supervision, limiting scalability especially in low-resource settings. There's a need for objective, consistent, and scalable evaluation methods.

Method: Proposes an AI framework combining TimeSformer video transformer with hierarchical temporal attention and weighted spatial attention for action recognition, plus YOLO-based object detection/tracking for fine-grained motion analysis of instrument kinematics.

Result: Achieved 87.7% frame-level action segmentation accuracy (increased to 93.62% with post-processing) and 76% average classification accuracy in replicating expert assessments across five skill aspects on 58 expert-annotated videos.

Conclusion: The system provides objective, consistent, and interpretable feedback for surgical skill assessment, enabling more standardized, data-driven training and evaluation in surgical education with potential for broader scalability.

Abstract: The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system’s potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.

[180] Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas, Pranav K Nayak, Divya Mehul Rajparia, Deekshith Patel, Yashmitha Gogineni, Konda Reddy Mopuri, Sumohana S. Channappayya

Main category: cs.CV

TL;DR: ViT encoder representations show compositional properties when analyzed using Discrete Wavelet Transform primitives.

Details

Motivation: To understand how Vision Transformers structure information by investigating whether their representations exhibit compositionality, similar to what has been studied in language models.

Method: Introduced a framework using Discrete Wavelet Transform (DWT) to obtain input-dependent primitives for vision, then tested whether composed representations from these primitives can reproduce original image representations in the ViT encoder’s latent space.

Result: Primitives from one-level DWT decomposition produce encoder representations that approximately compose in latent space, suggesting ViTs structure information compositionally.

Conclusion: Vision Transformers exhibit compositional properties in their representation space, offering new insights into how they organize visual information, analogous to compositional structures found in language models.

Abstract: While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.

[181] Spectral and Spatial Graph Learning for Multispectral Solar Image Compression

Prasiddha Siwakoti, Atefeh Khoshkhahtinat, Piyush M. Mehta, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva

Main category: cs.CV

TL;DR: A learned image compression framework for multispectral solar imagery using graph-based modules to model inter-band relationships and reduce spatial redundancy, achieving better spectral fidelity and reconstruction quality than baselines.

Details

Motivation: High-fidelity compression of multispectral solar imagery is challenging for space missions due to limited bandwidth, requiring preservation of fine spectral and spatial details while balancing compression efficiency.

Method: Two complementary modules: (1) Inter-Spectral Windowed Graph Embedding (iSWGE) models inter-band relationships by representing spectral channels as graph nodes with learned edge features; (2) Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C) combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures.

Result: On SDOML dataset across six EUV channels: 20.15% reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates.

Conclusion: The proposed learned compression framework effectively balances compression efficiency with preservation of spectral and spatial details for solar observations, outperforming existing methods in spectral fidelity and reconstruction quality.

Abstract: High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at https://github.com/agyat4/sgraph .

[182] Using Large Language Models To Translate Machine Results To Human Results

Trishna Niraula, Jonathan Stubblefield

Main category: cs.CV

TL;DR: AI pipeline combines YOLO object detection with GPT-4 to generate radiology reports from chest X-rays, achieving strong clinical accuracy but lacking natural writing style compared to human radiologists.

Details

Motivation: Current AI systems in medical imaging only output structured predictions, requiring radiologists to manually translate these into narrative reports. There's a need to bridge this gap by automatically generating full diagnostic narratives from AI findings.

Method: Developed a pipeline integrating YOLOv5 and YOLOv8 for anomaly detection in chest X-rays, then used GPT-4 to generate natural-language reports from the structured detection outputs (bounding boxes and class labels).

Result: YOLO models showed strong detection accuracy, and GPT-4 generated reports with high semantic similarity to ground truth (measured by cosine similarity). Human evaluation gave GPT-4 high clarity scores (4.88/5) but lower natural writing flow scores (2.81/5).

Conclusion: The AI pipeline achieves clinical accuracy in report generation but remains stylistically distinguishable from radiologist-authored text, indicating current systems can produce medically accurate content but need improvement in natural language fluency.

Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

[183] Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

Manikanta Kotthapalli, Banafsheh Rekabdar

Main category: cs.CV

TL;DR: MS-VQ-VAE: A lightweight multi-scale VQ-VAE for generating compact latent representations of low-resolution video (64x64) suitable for edge deployment, achieving 25.96 dB PSNR and 0.8375 SSIM on UCF101.

Details

Motivation: Traditional video codecs (H.264/HEVC) lack native support for machine learning latent representations, limiting integration into deep learning pipelines. Need for efficient compression suitable for bandwidth-sensitive scenarios like CDNs and edge devices.

Method: Extends VQ-VAE-2 to spatiotemporal setting with two-level hierarchical latent structure using 3D residual convolutions. Incorporates perceptual loss from pre-trained VGG16. Lightweight design (18.5M parameters) optimized for 64x64 video clips (32 frames at 16 FPS).

Result: Achieves 25.96 dB PSNR and 0.8375 SSIM on UCF101 test set. Improves over single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. Model is lightweight and suitable for edge deployment.

Conclusion: Proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios including real-time streaming, mobile video analytics, and CDN storage optimization.

Abstract: The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

[184] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

Main category: cs.CV

TL;DR: PhyGDPO: A physics-aware video generation framework that uses physics-augmented data and groupwise preference optimization to create physically consistent videos.

Details

Motivation: Current text-to-video generation methods produce good visual quality but often violate physical laws. Existing approaches struggle with physical reasoning generalization and lack training data with rich physics interactions.

Method: 1) PhyAugPipe: Uses VLM with chain-of-thought reasoning to create PhyVidGen-135K dataset. 2) PhyGDPO: Physics-aware Groupwise Direct Preference Optimization with Physics-Guided Rewarding (PGR) scheme using VLM-based physics rewards. 3) LoRA-Switch Reference (LoRA-SR) for efficient training without reference duplication.

Result: Significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2 benchmarks.

Conclusion: The proposed framework successfully addresses physical consistency in video generation through physics-augmented data collection and principled physics-aware optimization, achieving superior performance on physics-focused benchmarks.

Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[185] OCP-LS: An Efficient Algorithm for Visual Localization

Jindi Zhong, Hongxia Wang, Huanshui Zhang

Main category: cs.CV

TL;DR: A novel second-order optimization algorithm for deep learning that uses OCP method and Hessian diagonal approximation to achieve faster convergence and better performance on visual localization tasks.

Details

Motivation: To address large-scale optimization problems in deep learning, particularly for visual localization tasks, where conventional optimization algorithms may have limitations in convergence speed, training stability, and robustness to noise.

Method: Proposes a second-order optimization algorithm that incorporates the OCP (Optimal Control Problem) method and appropriately approximates the diagonal elements of the Hessian matrix to make it scalable for large-scale deep learning problems.

Result: Extensive experiments on multiple standard visual localization benchmarks show significant superiority over conventional optimization algorithms, achieving competitive localization accuracy with faster convergence, enhanced training stability, and improved robustness to noise interference.

Conclusion: The proposed second-order optimization framework effectively addresses large-scale optimization challenges in deep learning, particularly for visual localization, offering a practical solution with superior performance characteristics compared to existing methods.

Abstract: This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

Wentao Zhang, Tao Fang, Lina Lu, Lifei Wang, Weihe Zhong

Main category: cs.CV

TL;DR: CPJ is a training-free few-shot framework that uses structured image captions and LLM-as-Judge to improve agricultural pest/disease VQA without fine-tuning.

Details

Motivation: Existing crop disease diagnosis methods require costly supervised fine-tuning and perform poorly under domain shifts, lacking interpretability.

Method: Caption-Prompt-Judge (CPJ) framework: generates multi-angle captions using vision-language models, refines them via LLM-as-Judge module, then uses dual-answer VQA for recognition and management responses.

Result: On CDDMBench, CPJ improves disease classification by +22.7 percentage points and QA score by +19.5 points over no-caption baselines using GPT-5-mini captions with GPT-5-Nano.

Conclusion: CPJ provides transparent, evidence-based reasoning for robust and explainable agricultural diagnosis without requiring fine-tuning, advancing agricultural AI applications.

Abstract: Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption–Prompt–Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

[187] RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Tianyi Zhao, Jiawen Xi, Linhui Xiao, Junnan Li, Xue Yang, Maoxun Yuan, Xingxing Wei

Main category: cs.CV

TL;DR: RGBT-Ground: First large-scale visual grounding benchmark for complex real-world scenarios using RGB-Thermal image pairs, with RGBT-VGNet baseline outperforming adapted methods.

Details

Motivation: Existing VG benchmarks are limited to clean environments (like COCO) with limited scene diversity, failing to reflect real-world complexity (illumination, weather changes) needed for evaluating model robustness in safety-critical applications.

Method: 1) Create RGBT-Ground benchmark with spatially aligned RGB-Thermal image pairs, referring expressions, bounding boxes, and fine-grained annotations. 2) Propose unified framework supporting uni-modal (RGB/TIR) and multi-modal (RGB-TIR) inputs. 3) Develop RGBT-VGNet baseline for fusing complementary visual modalities.

Result: RGBT-VGNet significantly outperforms adapted existing methods, particularly in challenging nighttime and long-distance scenarios, demonstrating effectiveness of multi-modal fusion for robust visual grounding.

Conclusion: RGBT-Ground enables comprehensive evaluation of visual grounding in complex real-world conditions, and RGBT-VGNet provides effective baseline for robust multi-modal grounding, with all resources to be publicly released for future research.

Abstract: Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

[188] Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Fuyu Dong, Ke Li, Di Wang, Nan Luo, Yiming Zhang, Kaiyu Li, Jianfei Yang, Quan Wang

Main category: cs.CV

TL;DR: DARFT improves CDVQA by addressing decision ambiguity through reinforcement fine-tuning that targets samples with small probability margins between correct answers and strong distractors.

Details

Motivation: Current CDVQA models often fail due to decision ambiguity rather than clear errors, where models assign similar confidence to correct answers and strong distractors. This ambiguity undermines model discriminability and robustness.

Method: Proposes DARFT: Decision-Ambiguity-guided Reinforcement Fine-Tuning. First mines Decision-Ambiguous Samples (DAS) using an SFT-trained reference policy, then applies group-relative policy optimization on the mined subset using multi-sample decoding and intra-group relative advantages.

Result: Extensive experiments show consistent gains over supervised fine-tuning baselines, particularly under few-shot settings. The method effectively suppresses strong distractors and sharpens decision boundaries without additional supervision.

Conclusion: Explicitly optimizing decision-ambiguous samples is crucial for improving CDVQA model discriminability and robustness. DARFT provides an effective framework for addressing decision ambiguity through targeted reinforcement fine-tuning.

Abstract: Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

[189] SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

Main category: cs.CV

TL;DR: SliceLens is a hypothesis-driven framework using LLMs/VLMs for fine-grained error slice discovery in instance-level vision tasks, outperforming existing methods and validated on new benchmark FeSD.

Details

Motivation: Existing slice discovery methods are limited to image classification, not handling multi-instance tasks like detection/segmentation. Current approaches lack fine-grained reasoning for complex visual relationships in real-world error slices, and benchmarks are biased toward classification with artificial ground truth.

Method: SliceLens leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning. It enables reliable identification of fine-grained, interpretable error slices by combining hypothesis generation with verification.

Result: SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on the new FeSD benchmark. It identifies interpretable slices that facilitate actionable model improvements, validated through model repair experiments.

Conclusion: SliceLens addresses limitations of existing slice discovery methods by enabling fine-grained error analysis for instance-level vision tasks through LLM/VLM-powered hypothesis-driven reasoning, with superior performance demonstrated on the new FeSD benchmark.

Abstract: Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

[190] 3D Semantic Segmentation for Post-Disaster Assessment

Nhut Le, Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: This paper addresses the lack of specialized 3D datasets for post-disaster assessment by creating a UAV-captured dataset from Hurricane Ian and evaluating SOTA 3D segmentation models, revealing their limitations in disaster environments.

Details

Motivation: Natural disasters cause severe threats to human lives and economic losses, but existing deep learning models lack datasets specifically designed for post-disaster environments, creating a gap in effective 3D semantic segmentation for disaster assessment.

Method: Constructed a specialized 3D dataset using UAV-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. Evaluated state-of-the-art 3D semantic segmentation models including Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset.

Result: Evaluation exposed significant limitations in existing 3D semantic segmentation methods for disaster-stricken regions, demonstrating that current SOTA models perform inadequately in post-disaster environments.

Conclusion: The findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response capabilities.

Abstract: The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.

[191] Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers

Zheng Liu, Jinchao Zhu, Gao Huang

Main category: cs.CV

TL;DR: CLoRA is a novel collaborative low-rank adaptation method that balances fine-tuning performance and parameter efficiency for vision transformers through base-space sharing and diversity enhancement.

Details

Motivation: Existing LoRA methods for vision transformers either sacrifice performance for parameter efficiency or introduce excessive parameters, failing to achieve a good balance between learning performance and parameter efficiency.

Method: CLoRA consists of two components: 1) Base-space sharing where all low-rank modules share down/up-projection spaces to maintain parameter efficiency while expanding learning capacity, and 2) Sample-agnostic diversity enhancement (SADE) that regularizes similarities among low-rank matrices to encourage diverse representations and reduce redundancy.

Result: Extensive experiments on image and point cloud datasets show CLoRA achieves better balance between learning performance and parameter efficiency compared to state-of-the-art methods, while requiring the fewest GFLOPs for point cloud analysis.

Conclusion: CLoRA successfully addresses the trade-off between fine-tuning performance and parameter efficiency in vision transformer adaptation, offering an effective collaborative approach with shared projection spaces and diversity regularization.

Abstract: Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.

Panquan Yang, Junfei Huang, Zongzhangbao Yin, Yingsong Hu, Anni Xu, Xinyi Luo, Xueqi Sun, Hai Wu, Sheng Ao, Zhaoxing Zhu, Chenglu Wen, Cheng Wang

Main category: cs.CV

TL;DR: The paper introduces 3D visual grounding for outdoor monitoring scenarios, creates the first large-scale dataset (MoniRefer), and proposes an end-to-end method (Moni3DVG) for infrastructure-level traffic scene understanding.

Details

Motivation: Existing 3D visual grounding focuses on indoor/outdoor driving scenes, but outdoor monitoring scenarios from roadside infrastructure remain unexplored due to lack of paired point cloud-text data. Infrastructure-level understanding of traffic scenes is critical for roadside systems.

Method: Proposes Moni3DVG, an end-to-end method that leverages appearance information from images and geometry/optical information from point clouds for multi-modal feature learning and 3D object localization.

Result: Created MoniRefer dataset with ~136,018 objects and 411,128 natural language expressions from real-world traffic intersections. Extensive experiments demonstrate the superiority and effectiveness of the proposed method.

Conclusion: The paper introduces a novel 3D visual grounding task for outdoor monitoring, provides the first large-scale dataset for this domain, and presents an effective multi-modal approach that outperforms existing methods.

Abstract: 3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

[193] LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning

Shuyuan Lin, Yu Guo, Xiao Chen, Yanjie Liang, Guobao Xiao, Feiran Huang

Main category: cs.CV

TL;DR: Proposes Layer-by-Layer Hierarchical Attention Network to improve feature point matching by handling outliers through stage fusion, hierarchical extraction, and attention mechanisms.

Details

Motivation: Feature point matching is fundamental but suffers from outlier interference, especially with high outlier proportions. Need to extract high-quality information while reducing negative sample errors.

Method: Layer-by-Layer Hierarchical Attention Network with: 1) Layer-by-layer channel fusion module to preserve semantic information from each stage, 2) Hierarchical attention module to capture global perception and structural semantics, 3) Two architectures for feature extraction and integration.

Result: Outperforms state-of-the-art methods on YFCC100M and SUN3D datasets for both outlier removal and camera pose estimation.

Conclusion: The proposed network effectively addresses outlier issues in feature point matching through hierarchical attention mechanisms, improving matching precision and robustness.

Abstract: Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network’s representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at http://www.linshuyuan.com.

[194] FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes

Qingyu Xu, Runtong Zhang, Zihuan Qiu, Fanman Meng

Main category: cs.CV

TL;DR: This paper introduces FRS-YOLO, an improved object detection model for fire rescue scenarios, along with a new FireRescue dataset covering urban, mountainous, forest, and water rescue scenes with 8 key categories.

Details

Motivation: Current fire detection research has two main limitations: 1) insufficient focus on complex urban rescue scenes (more frequent than forest/mountain areas), and 2) limited detection categories (mostly just flames/smoke) lacking crucial command decision targets like fire trucks and firefighters.

Method: 1) Created FireRescue dataset with 15,980 images and 32,000 bounding boxes covering 8 key categories across multiple rescue scenarios. 2) Proposed FRS-YOLO with two improvements: a plug-and-play multidimensional collaborative enhancement attention module to reduce inter-class confusion, and a dynamic feature sampler to enhance foreground features and address smoke occlusion/background interference.

Result: The paper demonstrates that object detection in fire rescue scenarios is highly challenging, and the proposed FRS-YOLO method effectively improves detection performance of YOLO series models in this specific context.

Conclusion: The FireRescue dataset and FRS-YOLO model address critical gaps in fire rescue object detection by covering complex urban scenes and key command-relevant categories, with improved performance through attention mechanisms and feature enhancement techniques.

Abstract: Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named “FireRescue” for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.

[195] Renormalization Group Guided Tensor Network Structure Search

Maolin Wang, Bowen Yu, Sheng Zhang, Linjie Mi, Wanyu Wang, Yiqi Wang, Pengyue Jia, Xuetao Wei, Zenglin Xu, Ruocheng Guo, Xiangyu Zhao

Main category: cs.CV

TL;DR: RGTN is a physics-inspired tensor network structure search framework that uses renormalization group flows for multi-scale optimization, achieving state-of-the-art compression with 4-600× speedup over existing methods.

Details

Motivation: Existing tensor network structure search (TN-SS) methods face limitations in computational tractability, structure adaptivity, and optimization robustness. They struggle with single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth evolution, and separated structure-parameter optimization causing inefficiency.

Method: RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions, featuring learnable edge gates for topology modification and intelligent proposals based on physical quantities like node tension (measuring local stress) and edge information flow (quantifying connectivity importance). It starts from low-complexity coarse scales and refines to finer ones.

Result: Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600× faster than existing methods.

Conclusion: RGTN validates the effectiveness of physics-inspired approaches for tensor network structure search, enabling efficient discovery of compact network structures while escaping local minima via scale-induced perturbations.

Abstract: Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.

[196] From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

Siyang Wang, Hanting Li, Wei Li, Jie Hu, Xinghao Chen, Feng Zhao

Main category: cs.CV

TL;DR: RadAR: A parallelizable framework for accelerating autoregressive visual generation using radial topology and nested attention to preserve spatial coherence while improving efficiency.

Details

Motivation: Traditional autoregressive models in visual generation suffer from low inference efficiency due to sequential token-by-token decoding, despite their success in language modeling. Visual tokens exhibit strong local dependencies and spatial correlations that aren't fully exploited by standard raster-scan decoding orders.

Method: Organizes generation around radial topology: selects initial token as starting point, groups other tokens into concentric rings by spatial distance from center, generates ring-wise from inner to outer regions enabling parallel prediction within same rings. Introduces nested attention mechanism to dynamically refine implausible outputs during forward pass to mitigate error accumulation.

Result: Significantly improves generation efficiency while preserving representational capacity and structural locality of visual scenes. The radial parallel prediction with dynamic output correction prevents model collapse and reduces error accumulation.

Conclusion: RadAR provides an efficient and parallelizable framework for autoregressive visual generation that maintains spatial coherence and structural locality while substantially increasing parallelization through radial topology and nested attention mechanisms.

Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors–a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

[197] Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao

Main category: cs.CV

TL;DR: EVOL-SAM3 is a zero-shot reasoning segmentation framework that uses evolutionary search at inference time to iteratively refine prompts through a Generate-Evaluate-Evolve loop, outperforming both static baselines and supervised methods.

Details

Motivation: Current reasoning segmentation approaches have significant limitations: SFT suffers from catastrophic forgetting and domain dependency, RL has training instability and rigid reward functions, and training-free methods are limited by static single-pass inference that lacks reasoning depth and self-correction capabilities.

Method: EVOL-SAM3 reformulates reasoning segmentation as an inference-time evolutionary search process. It maintains a population of prompt hypotheses and iteratively refines them through a “Generate-Evaluate-Evolve” loop with three key components: Visual Arena for reference-free pairwise tournament evaluation, Semantic Mutation operator for diversity injection and error correction, and Heterogeneous Arena integrating geometric priors with semantic reasoning for final selection.

Result: Extensive experiments show EVOL-SAM3 substantially outperforms static baselines and significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting.

Conclusion: EVOL-SAM3 successfully addresses the limitations of current reasoning segmentation approaches by introducing an evolutionary search paradigm at inference time, enabling deeper reasoning, self-correction capabilities, and superior performance without requiring training.

Abstract: Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass “generate-then-segment” chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a “Generate-Evaluate-Evolve” loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

[198] FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

Main category: cs.CV

TL;DR: FlowBlending is a stage-aware multi-model sampling strategy for video diffusion models that uses large models at capacity-sensitive timesteps and small models at intermediate stages, achieving up to 1.65x faster inference with 57.35% fewer FLOPs while maintaining quality.

Details

Motivation: The authors observed that model capacity impact varies across timesteps in video diffusion models - crucial for early and late stages but negligible during intermediate stages. This insight motivates a more efficient sampling strategy that doesn't require full model capacity throughout the entire generation process.

Method: FlowBlending employs a stage-aware multi-model sampling approach: uses large models at capacity-sensitive stages (early and late timesteps) and switches to small models at intermediate stages. The method includes criteria for choosing stage boundaries and uses velocity-divergence analysis as a proxy for identifying capacity-sensitive regions.

Result: Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B) models, FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs while maintaining visual fidelity, temporal coherence, and semantic alignment comparable to large models. It’s also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup.

Conclusion: FlowBlending demonstrates that intelligently allocating model capacity across different generation stages can significantly improve inference efficiency without sacrificing quality, offering a practical approach to accelerate video diffusion models while maintaining their generation capabilities.

Abstract: In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

[199] EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu

Main category: cs.CV

TL;DR: EchoFoley introduces a new video-grounded sound generation task with fine-grained control, addressing limitations of current VT2A models through symbolic event representation and a large curated dataset.

Details

Motivation: Current video-text-to-audio (VT2A) models suffer from three key limitations: visual-text conditioning imbalance leading to visual dominance, lack of fine-grained controllability definitions, and weak instruction understanding due to brief categorical tags in existing datasets.

Method: Introduces EchoFoley task with symbolic representation for sounding events specifying when, what, and how sounds are produced. Creates EchoFoley-6k benchmark with 6,000+ video-instruction-annotation triplets. Proposes EchoVidia framework with sounding-event-centric agentic generation using slow-fast thinking strategy.

Result: EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

Conclusion: The EchoFoley framework successfully addresses key limitations of current VT2A models by enabling fine-grained sound control through symbolic event representation and hierarchical semantic control, significantly improving both controllability and perceptual quality.

Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

[200] Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

Xiang Liu, Yimin Zhou, Jinxiang Wang, Yujun Huang, Shuzhao Xie, Shiyu Qin, Mingyao Hong, Jiawei Li, Yaowei Wang, Zhi Wang, Shu-Tao Xia, Bin Chen

Main category: cs.CV

TL;DR: Splatwizard is a unified benchmark toolkit for evaluating 3D Gaussian Splatting compression models, addressing the lack of standardized evaluation tools in this rapidly growing field.

Details

Motivation: The rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression tasks. Existing benchmarks often lack specific metrics to holistically assess unique characteristics like rendering speed, rate distortion trade-offs, memory efficiency, and geometric accuracy.

Method: Splatwizard provides an easy-to-use framework to implement new 3DGS compression models and utilize state-of-the-art techniques. It includes an integrated pipeline that automates calculation of key performance indicators including image-based quality metrics, chamfer distance of reconstructed mesh, rendering frame rates, and computational resource consumption.

Result: The authors introduce Splatwizard as a solution to the evaluation gap in 3DGS compression research, making the code publicly available at https://github.com/splatwizard/splatwizard.

Conclusion: Splatwizard addresses the critical need for standardized benchmarking in 3D Gaussian Splatting compression research by providing a comprehensive toolkit that enables fair comparison and evaluation of different compression methods across multiple performance dimensions.

Abstract: The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at https://github.com/splatwizard/splatwizard

[201] UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Ankit Dhiman, Srinath R, Jaswanth Reddy, Lokesh R Boregowda, Venkatesh Babu Radhakrishnan

Main category: cs.CV

TL;DR: Unified framework for 3D instance segmentation that merges feature learning and label assignment into a single stage, using learnable embeddings in Gaussian primitives with stabilized boundary optimization.

Details

Motivation: Existing 3D instance segmentation methods suffer from inconsistent 2D instance labels across views, leading to poor 3D predictions. Current approaches use two-stage pipelines with contrastive learning or preprocessing, which are hyperparameter-sensitive and time-consuming.

Method: Proposes a unified framework with learnable feature embeddings in Gaussian primitives, decoded into instance labels via “Embedding-to-Label” process. Addresses boundary artifacts by applying hard-mining with a linear layer on rasterized embeddings before triplet loss calculation for training stability.

Result: Outperforms baselines qualitatively and quantitatively on ScanNet, Replica3D, and Messy-Rooms datasets, with reduced training time and improved performance.

Conclusion: The unified approach effectively integrates optimization steps, stabilizes boundary learning, and achieves superior 3D instance segmentation performance across multiple datasets.

Abstract: 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel “Embedding-to-Label” process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

[202] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Takeru Kusakabe, Yudai Hirose, Mashiho Mukaida, Satoshi Ono

Main category: cs.CV

TL;DR: The paper proposes a projection-based adversarial attack method for DNN-based monocular depth estimation models using physics-in-the-loop optimization to create adversarial examples that cause depth misestimations.

Details

Motivation: DNN-based monocular depth estimation models are vulnerable to adversarial attacks, threatening their reliability in practical applications. There's a critical need to validate this vulnerability and enhance robustness.

Method: A projection-based adversarial attack method that projects perturbation light onto target objects, using physics-in-the-loop optimization to evaluate solutions in actual environments and distributed covariance matrix adaptation evolution strategy.

Result: The proposed method successfully created adversarial examples that lead to depth misestimations, causing parts of objects to disappear from the target scene.

Conclusion: The study demonstrates the vulnerability of DNN-based MDE models to physical adversarial attacks and provides a method to validate and potentially improve their robustness.

Abstract: Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization – evaluating candidate solutions in actual environments to account for device specifications and disturbances – and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.

[203] Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage, Rico Sennnrich

Main category: cs.CV

TL;DR: A method using regret minimization with derivative-free optimization improves multivariate mutual information estimates, enabling 2D-trained cross-modal systems to adapt online to 3D scenes via in-scene camera control without pretraining/finetuning.

Details

Motivation: Cross-modal systems trained on 2D visual inputs face a dimensional shift when processing 3D scenes, requiring a control module for in-scene cameras to bridge this gap while handling object occlusions and feature differentiation.

Method: The method improves multivariate mutual information estimates through regret minimization with derivative-free optimization, enabling off-the-shelf 2D-trained cross-modal systems to adapt online via in-scene camera control that learns directly from noisy vision-language model outputs.

Result: The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without requiring pretraining or finetuning, effectively handling object occlusions and feature differentiation.

Conclusion: The pairing of expressive mutual information measures with value-based optimization enables effective in-scene camera control for adapting 2D-trained cross-modal systems to 3D environments, overcoming the dimensional shift challenge without additional training.

Abstract: Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

[204] Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training

Andrew Tinits, Stephen Mann

Main category: cs.CV

TL;DR: The paper presents a method to apply nonlinear functions in Noise2Noise training without significant bias, enabling better denoising of high dynamic range images from Monte Carlo rendering.

Details

Motivation: Noise2Noise training requires noisy target images instead of clean ones, but nonlinear functions applied to noisy targets introduce bias. This limits preprocessing options, especially for HDR images where tone mapping is needed to handle outliers.

Method: Developed a theoretical framework to analyze nonlinear function effects on Noise2Noise training. Identified a class of nonlinear functions with minimal bias when combined with specific loss functions. Applied this to HDR image denoising using tone mapping functions.

Result: The method enables Noise2Noise training with tone mapping for Monte Carlo rendered HDR images. Results approach those of original implementations trained with high-sample count reference images, using only noisy training data.

Conclusion: Certain nonlinear functions can be applied to noisy targets in Noise2Noise training without significant bias, expanding its applicability to HDR image denoising and other domains requiring nonlinear preprocessing.

Abstract: The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.

[205] CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

Md Ahmed Al Muzaddid, Jordan A. James, William J. Beksi

Main category: cs.CV

TL;DR: CropTrack is a novel multiple-object tracking framework for agricultural environments that combines appearance and motion information to address challenges like repetitive patterns, similar appearances, and frequent occlusions that plague existing motion-based trackers.

Details

Motivation: Agricultural MOT faces unique challenges: repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Current trackers rely mainly on motion and struggle with identity preservation during strong occlusions. Appearance-based association is difficult due to high similarity between objects.

Method: CropTrack integrates appearance and motion information through three key components: 1) reranking-enhanced appearance association, 2) one-to-many association with appearance-based conflict resolution strategy, and 3) exponential moving average prototype feature bank to improve appearance-based association.

Result: CropTrack demonstrates consistent identity preservation on agricultural MOT datasets, outperforming traditional motion-based tracking methods. It achieves significant gains in identification F1 and association accuracy scores with lower identity switches compared to state-of-the-art methods.

Conclusion: The proposed CropTrack framework successfully addresses agricultural MOT challenges by effectively combining appearance and motion information, leading to improved identity preservation and association accuracy in challenging agricultural environments.

Abstract: Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.

Xunyi Zhao, Gengze Zhou, Qi Wu

Main category: cs.CV

TL;DR: MLLMs show poor context awareness and spatial reasoning in embodied navigation tasks, despite following instructions well. A new benchmark VLN-MME reveals these limitations through systematic evaluation.

Details

Motivation: To investigate MLLMs' potential as embodied agents for vision-language navigation tasks, which require multi-round dialogue, spatial reasoning, and sequential action prediction - capabilities not yet fully explored.

Method: Introduces VLN-MME, a unified evaluation framework that bridges traditional navigation datasets into a standardized benchmark. Uses modular design for structured comparisons and component-level ablations across MLLM architectures, agent designs, and navigation tasks.

Result: Enhancing baseline agents with Chain-of-Thought reasoning and self-reflection unexpectedly decreases performance. MLLMs exhibit poor context awareness in embodied navigation - they can follow instructions and structure output, but have low 3D spatial reasoning fidelity.

Conclusion: VLN-MME provides groundwork for systematic evaluation of MLLMs in embodied navigation, revealing limitations in sequential decision-making. Findings offer crucial guidance for MLLM post-training as embodied agents.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

[207] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou

Main category: cs.CV

TL;DR: ShowUI-π is a flow-based generative model for GUI agents that enables both discrete clicks and continuous drag actions, addressing limitations of existing GUI agents that can’t handle closed-loop trajectories like dragging progress bars.

Details

Motivation: Existing GUI agents only support discrete click predictions (x,y coordinates), which prevents them from performing free-form, closed-loop trajectories like dragging progress bars that require continuous perception and adjustment. There's a need for more dexterous manipulation capabilities in digital environments.

Method: 1) Unified Discrete-Continuous Actions model integrating both clicks and drags; 2) Flow-based Action Generation using a lightweight action expert to predict incremental cursor adjustments from continuous visual observations; 3) Created ScreenDrag benchmark with 20K manually collected drag trajectories across five domains (PowerPoint, Adobe Premiere Pro, etc.)

Result: ShowUI-π achieves 26.98 score on ScreenDrag benchmark, outperforming proprietary GUI agents (Operator: 13.27, Gemini-2.5-CUA: 22.18) with only 450M parameters, demonstrating both task difficulty and approach effectiveness.

Conclusion: ShowUI-π advances GUI agents toward human-like dexterous control in digital environments by enabling continuous drag actions alongside discrete clicks, with open-source code available to support further research.

Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents’ drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

[208] OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation

Meng Lan, Lefei Zhang, Xiaomeng Li

Main category: cs.CV

TL;DR: OFL-SAM2 is a prompt-free adaptation of SAM2 for medical image segmentation that uses limited annotated data to train a lightweight mapping network, eliminating the need for manual prompts while achieving state-of-the-art performance.

Details

Motivation: SAM2 shows promise for medical image segmentation but requires extensive annotated data and manual prompts, which are labor-intensive and require medical expertise. There's a need for label-efficient adaptation that eliminates prompt requirements.

Method: OFL-SAM2 uses a lightweight mapping network trained with limited annotated samples to capture medical knowledge and transform generic features into target features. It includes: (1) an online few-shot learner for training the mapping network, and (2) an adaptive fusion module that dynamically integrates target features with SAM2’s memory-attention features.

Result: Extensive experiments on three diverse medical image segmentation datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

Conclusion: OFL-SAM2 successfully adapts SAM2 to medical image segmentation in a prompt-free, label-efficient manner by leveraging limited annotated data to train a mapping network that provides additional discriminative target representations, eliminating the need for manual prompts while maintaining strong performance.

Abstract: The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model’s generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

[209] FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Zichen Tang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia, Zhuodi Hao, Zhongjun Yang, Yuanze Li, Haolin Tian, Xinyi Hu, Peizhi Zhao, Yuan Liu, Zhengyu Wang, Xianghe Wang, Yiling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Ruining Cao, Haocheng Gao

Main category: cs.CV

TL;DR: FinMMDocR is a bilingual multimodal benchmark for evaluating MLLMs on real-world financial numerical reasoning, featuring scenario-aware problems, complex financial documents, and multi-step computations.

Details

Motivation: Existing benchmarks lack the complexity of real-world financial reasoning. There's a need for a benchmark that incorporates financial scenarios, diverse document types, and multi-step reasoning to properly evaluate MLLMs on practical financial tasks.

Method: Created a benchmark with 1,200 expert-annotated problems incorporating 12 types of implicit financial scenarios, 837 Chinese/English documents spanning 9 document types averaging 50.8 pages, and problems requiring 11-step reasoning on average (5.3 extraction + 5.7 calculation steps).

Result: The best-performing MLLM achieves only 58.0% accuracy, showing significant room for improvement. Different RAG methods show significant performance variations, indicating the benchmark’s ability to differentiate model capabilities.

Conclusion: FinMMDocR is a challenging benchmark that can drive improvements in MLLMs and reasoning-enhanced methods for complex multimodal reasoning in real-world financial scenarios.

Abstract: We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

[210] Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

Itallo Patrick Castro Alves Da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza, Baldoino Fonseca dos Santos Neto, Marcio de Medeiros Ribeiro

Main category: cs.CV

TL;DR: Compression techniques (quantization, pruning, clustering) can preserve or even improve model robustness to natural corruption, with customized combinations offering best multi-objective results for efficient deployment.

Details

Motivation: Model compression is essential for deploying computer vision on resource-constrained devices, but it may affect robustness under natural corruption, requiring comprehensive evaluation of compression techniques' impact on robustness.

Method: Evaluated quantization, pruning, and weight clustering individually and in combination on CNN architectures (ResNet-50, VGG-19, MobileNetV2) using CIFAR-10-C and CIFAR-100-C datasets with multiobjective assessment.

Result: Certain compression strategies preserve or improve robustness, especially on complex architectures; customized technique combinations produce beneficial multi-objective results balancing robustness, accuracy, and compression ratio.

Conclusion: Compression techniques can maintain or enhance robustness to corruption, with tailored combinations providing optimal trade-offs for robust and efficient model deployment in real-world corrupted environments.

Abstract: Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.

[211] Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

Bartłomiej Olber, Jakub Winter, Paweł Wawrzyński, Andrii Gamalii, Daniel Górniak, Marcin Łojek, Robert Nowak, Krystian Radlak

Main category: cs.CV

TL;DR: A novel lidar domain adaptation method using neuron activation patterns that achieves SOTA performance by annotating only a small, diverse subset of target domain samples with minimal annotation budget.

Details

Motivation: 3D object detectors for autonomous vehicles struggle with domain generalization (e.g., US-trained models performing poorly in Asia/Europe), creating a need for efficient domain adaptation methods that don't require extensive re-annotation.

Method: Proposes a lidar domain adaptation approach based on neuron activation patterns, selecting a small but representative and diverse subset of target domain samples for annotation. Combines with post-training techniques inspired by continual learning to prevent weight drift from the original model.

Result: Empirical evaluation shows the proposed approach outperforms both linear probing and state-of-the-art domain adaptation techniques while requiring very small annotation budget.

Conclusion: State-of-the-art domain adaptation performance can be achieved through neuron activation pattern-based sample selection and minimal annotation, combined with continual learning-inspired techniques to maintain model stability.

Abstract: 3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

[212] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Xinran Gong, Gorkem Durak, Halil Ertugrul Aktas, Vedat Cicek, Jinkui Hao, Ulas Bagci, Nilay S. Shah, Bo Zhou

Main category: cs.CV

TL;DR: ProDM is a diffusion model that removes motion artifacts from non-gated chest CT scans to improve coronary artery calcium scoring accuracy.

Details

Motivation: CAC scoring from chest CT is important for cardiovascular risk assessment, but non-gated CT scans suffer from motion artifacts that reduce accuracy. ECG-gated cardiac CTs are better but not widely available due to cost and technical requirements.

Method: ProDM uses three key components: 1) CAC motion simulation engine to create synthetic non-gated CTs from gated CTs for supervised training, 2) property-aware learning with calcium-specific priors and consistency loss, 3) progressive correction scheme across diffusion steps.

Result: ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared to baselines. Reader study confirms it suppresses motion artifacts and improves clinical usability.

Conclusion: ProDM demonstrates the potential of progressive, property-aware generative frameworks for reliable CAC quantification from routine non-gated chest CT imaging.

Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

[213] VIPER: Process-aware Evaluation for Generative Video Reasoning

Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu

Main category: cs.CV

TL;DR: VIPER benchmark introduces process-aware evaluation for video reasoning models, revealing only 20% process-outcome consistency and significant outcome-hacking in current models.

Details

Motivation: Existing video reasoning evaluations rely on single-frame assessments, leading to outcome-hacking where models reach correct conclusions through erroneous processes, necessitating process-aware evaluation.

Method: Proposes VIPER benchmark with 16 tasks across 6 reasoning domains, and introduces Process-outcome Consistency (POC@r) metric using VLM-as-Judge with hierarchical rubric to evaluate intermediate steps and final results.

Result: State-of-the-art video models achieve only about 20% POC@1.0, show significant outcome-hacking, and reveal substantial gap between current video generation and true generalized visual reasoning.

Conclusion: Process-aware evaluation is crucial for assessing video reasoning capabilities, current models have limited process-outcome consistency, and VIPER benchmark will be publicly released to advance the field.

Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

[214] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

Main category: cs.CV

TL;DR: DarkEQA is a benchmark for evaluating Vision Language Models’ perceptual capabilities under realistic low-light conditions, revealing their limitations in dark environments.

Details

Motivation: Current VLM benchmarks focus on ideal lighting conditions, but real-world embodied agents need to operate 24/7 in various lighting conditions including low-light environments. There's a significant gap in evaluating VLMs' robustness to visual degradations like darkness, which is crucial for practical deployment.

Method: Created DarkEQA benchmark with physics-based low-light simulation in linear RAW space, modeling illumination drop and sensor noise followed by ISP-inspired rendering. Evaluates question answering from egocentric observations under controlled degradations to isolate perception bottlenecks.

Result: Systematic evaluation reveals significant limitations of state-of-the-art VLMs and Low-Light Image Enhancement models when operating under challenging low-light conditions.

Conclusion: DarkEQA addresses a critical gap in VLM evaluation for embodied agents, providing a physically realistic benchmark that exposes robustness issues in low-light conditions, enabling attributable analysis and future improvements.

Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments–a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs’ limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

[215] Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Zhenyu Cui, Jiahuan Zhou, Yuxin Peng

Main category: cs.CV

TL;DR: This paper introduces RFL-ReID, a new lifelong person re-identification task that eliminates the need for re-indexing historical gallery images, addressing privacy and computational cost issues while maintaining feature compatibility between old and new models.

Details

Motivation: Traditional lifelong person ReID methods require re-indexing historical gallery images after each model update, which raises privacy concerns and incurs high computational costs for large-scale galleries. This creates feature incompatibility between query features from updated models and gallery features from older models.

Method: The paper proposes a Bidirectional Continuous Compatible Representation (Bi-C2R) framework that continuously updates gallery features extracted by old models to perform efficient lifelong ReID in a compatible manner, eliminating the need for re-indexing.

Result: The proposed Bi-C2R method achieves leading performance on both the new RFL-ReID task and traditional L-ReID task, as demonstrated through theoretical analysis and extensive experiments on multiple benchmarks.

Conclusion: RFL-ReID is a more challenging but practical lifelong person re-identification task that addresses privacy and computational concerns. The Bi-C2R framework successfully enables compatible feature representation without re-indexing, making lifelong ReID more feasible in real-world applications.

Abstract: Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as “re-indexing”. However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

[216] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai

Main category: cs.CV

TL;DR: FoundationSLAM is a learning-based monocular dense SLAM system that combines flow estimation with geometric reasoning using foundation depth models for accurate tracking and mapping.

Details

Motivation: Previous flow-based SLAM approaches lack geometric consistency, which is essential for accurate and robust tracking and mapping. The authors aim to bridge the gap between flow estimation and geometric reasoning.

Method: 1) Hybrid Flow Network for geometry-aware correspondences; 2) Bi-Consistent Bundle Adjustment Layer for joint optimization of keyframe pose and depth; 3) Reliability-Aware Refinement mechanism for adaptive flow updates based on region reliability.

Result: Superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, real-time performance at 18 FPS, and strong generalization to various scenarios.

Conclusion: FoundationSLAM successfully addresses geometric consistency issues in flow-based SLAM, demonstrating practical applicability through real-time performance and robust generalization across diverse environments.

Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

[217] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu

Main category: cs.CV

TL;DR: A novel self-bootstrapping framework reframes visual dubbing from ill-posed inpainting to well-conditioned video-to-video editing using Diffusion Transformers, achieving superior lip sync and identity preservation.

Details

Motivation: Audio-driven visual dubbing lacks ideal training data (paired videos with identical visual conditions but different lip movements). Existing mask-based inpainting methods force models to simultaneously hallucinate missing content and sync lips, causing visual artifacts, identity drift, and poor synchronization.

Method: Proposes a self-bootstrapping framework with two Diffusion Transformers: first as a data generator to synthesize ideal training data (lip-altered companion videos), then as an audio-driven editor trained end-to-end on these aligned video pairs. Includes timestep-adaptive multi-phase learning strategy to disentangle conflicting editing objectives across diffusion timesteps.

Result: Achieves highly accurate lip sync, faithful identity preservation, and exceptional robustness in challenging in-the-wild scenarios. The method provides complete visual context for precise audio-driven lip modifications.

Conclusion: Reframing visual dubbing as a well-conditioned video-to-video editing problem with complete frame-aligned input conditioning enables superior performance. Introduces ContextDubBench for robust evaluation in diverse practical scenarios.

Abstract: Audio-driven visual dubbing aims to synchronize a video’s lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject’s lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

[218] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang

Main category: cs.CV

TL;DR: SpaceTimePilot is a video diffusion model that independently controls camera viewpoint and motion sequence for continuous space-time exploration in video generation.

Details

Motivation: Current video generation models lack explicit control over both spatial (camera viewpoint) and temporal (motion sequence) dimensions independently, limiting creative exploration of dynamic scenes.

Method: Introduces animation time-embedding for explicit motion control, temporal-warping training using multi-view datasets, improved camera-conditioning, and CamxTime dataset for full space-time coverage.

Result: Demonstrates clear space-time disentanglement and strong performance on both real-world and synthetic data compared to prior work.

Conclusion: SpaceTimePilot successfully achieves independent control over space and time in video generation, enabling continuous exploration of dynamic scenes through novel training strategies and datasets.

Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video’s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

[219] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Dian Shao, Mingfei Shi, Like Liu

Main category: cs.CV

TL;DR: FineTec is a unified framework for fine-grained action recognition from temporally corrupted skeleton sequences that uses context-aware completion, spatial decomposition with targeted perturbation, physics-driven acceleration estimation, and GCN-based recognition.

Details

Motivation: Real-world skeleton sequences often have substantial missing data from online pose estimation, and existing methods struggle to recover temporal dynamics and fine-grained spatial structures, losing subtle motion cues needed to distinguish similar actions.

Method: 1) Context-aware completion with diverse temporal masking to restore base skeleton sequences; 2) Spatial decomposition into 5 semantic regions divided into dynamic/static subgroups with targeted perturbation for augmentation; 3) Physics-driven estimation using Lagrangian dynamics to compute joint accelerations; 4) GCN-based recognition head processing both fused skeleton positions and accelerations.

Result: Significantly outperforms previous methods on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks under various corruption levels. Achieves 89.1% top-1 accuracy on Gym99-severe and 78.1% on Gym288-severe.

Conclusion: FineTec effectively addresses fine-grained action recognition under temporal corruption by combining skeleton restoration, spatial decomposition, physics-driven dynamics estimation, and multi-modal fusion, demonstrating robustness and generalizability across different datasets and corruption levels.

Abstract: Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.

[220] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Edit3r is a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images, eliminating the need for per-scene optimization or pose estimation.

Details

Motivation: Prior 3D editing methods require per-scene optimization, which is slow and computationally expensive. There's a need for fast, feed-forward 3D scene editing that can handle unposed, view-inconsistent images without optimization.

Method: Edit3r uses: (1) SAM2-based recoloring strategy to generate cross-view-consistent supervision for training, (2) asymmetric input strategy pairing recolored reference views with raw auxiliary views to fuse disparate observations, and (3) direct prediction of instruction-aligned 3D edits in a single forward pass.

Result: Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed. It effectively handles images edited by 2D methods like InstructPix2Pix despite not being trained on them.

Conclusion: Edit3r enables fast, photorealistic 3D scene editing without optimization or pose estimation, making it promising for real-time 3D editing applications. The authors also introduce DL3DV-Edit-Bench for large-scale quantitative evaluation.

Abstract: We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

[221] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: GaMO is a geometry-aware multi-view outpainting framework that reformulates sparse-view 3D reconstruction by expanding field of view from existing camera poses instead of generating new viewpoints, achieving state-of-the-art results with 25× speedup.

Details

Motivation: Current diffusion-based methods for sparse-view 3D reconstruction have three critical limitations: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines.

Method: GaMO reformulates sparse-view reconstruction through multi-view outpainting, expanding field of view from existing camera poses rather than generating new viewpoints. It employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training.

Result: Extensive experiments on Replica and ScanNet++ show state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving 25× speedup over SOTA diffusion-based methods with processing time under 10 minutes.

Conclusion: GaMO provides an effective solution to sparse-view 3D reconstruction by addressing geometric consistency and computational efficiency through multi-view outpainting, demonstrating superior performance and practical efficiency.

Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

[222] Matching Semantically Similar Non-Identical Objects

Yusuke Marumo, Kazuhiko Kawamoto, Satomi Tanaka, Shigenobu Hirano, Hiroshi Kera

Main category: cs.CV

TL;DR: Proposes Semantic Enhancement Weighting (SEW) for pixel-level matching of non-identical but semantically similar objects by incorporating object detector semantic information into sparse feature matching.

Details

Motivation: Real-world objects are often similar but not identical (different dog breeds, car models, flower colors). Existing feature matching methods focus on identical objects from different viewpoints, but need extension to handle semantically similar non-identical objects.

Method: Semantic Enhancement Weighting (SEW) - a weighting scheme that incorporates semantic information from object detectors into existing sparse feature matching methods to enable matching of semantically similar objects.

Result: Successful pixel-level matching between non-identical objects in various scenarios: in-class design variations, class discrepancies, and domain shifts (photo vs. drawing, image corruptions).

Conclusion: SEW effectively extends feature matching capabilities from identical objects to semantically similar non-identical objects, addressing real-world object matching challenges across variations and domains.

Abstract: Not identical but similar objects are ubiquitous in our world, ranging from four-legged animals such as dogs and cats to cars of different models and flowers of various colors. This study addresses a novel task of matching such non-identical objects at the pixel level. We propose a weighting scheme of descriptors, Semantic Enhancement Weighting (SEW), that incorporates semantic information from object detectors into existing sparse feature matching methods, extending their targets from identical objects captured from different perspectives to semantically similar objects. The experiments show successful matching between non-identical objects in various cases, including in-class design variations, class discrepancy, and domain shifts (e.g., photo vs. drawing and image corruptions). The code is available at https://github.com/Circ-Leaf/NIOM .

[223] Reconstructing Hand-Held Objects in 3D from Images and Videos

Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik

Main category: cs.CV

TL;DR: A scalable method for reconstructing hand-held objects from monocular videos using 3D hand estimation and retrieval-augmented reconstruction with large vision/language models.

Details

Motivation: Hand-held objects are challenging to reconstruct from videos due to hand occlusion and limited visibility, but 3D hand estimation and the limited set of manipulanda provide strong anchors for reconstruction.

Method: Two-stage approach: (1) MCC-Hand-Object (MCC-HO) for single-frame joint hand-object reconstruction, (2) Retrieval-Augmented Reconstruction (RAR) using GPT-4(V) to retrieve matching 3D models from a text-to-3D generative model for temporal consistency.

Result: Achieves state-of-the-art performance on both lab and Internet image/video datasets for hand-held object reconstruction.

Conclusion: The approach successfully leverages 3D hand estimation and large vision/language models to overcome challenges in hand-held object reconstruction, providing temporally consistent results with unified geometry across frames.

Abstract: Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from Internet videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the object, and (2) the set of manipulanda is small relative to all possible objects. With these insights in mind, we present a scalable paradigm for hand-held object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. In order to obtain the best performing single frame model, we first present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image(s); we call this alignment Retrieval-Augmented Reconstruction (RAR). RAR provides unified object geometry across all frames, and the result is rigidly aligned with both the input images and 3D MCC-HO observations in a temporally consistent manner. Experiments demonstrate that our approach achieves state-of-the-art performance on lab and Internet image/video datasets. We make our code and models available on the project website: https://janehwu.github.io/mcc-ho

[224] ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Samar Khanna, Medhanie Irgau, David B. Lobell, Stefano Ermon

Main category: cs.CV

TL;DR: ExPLoRA is a parameter-efficient technique that extends self-supervised pre-training to new domains by unfreezing 1-2 ViT blocks and using LoRA for other layers, achieving SOTA results on satellite imagery with <10% parameters.

Details

Motivation: To address the under-explored question of whether pre-trained foundation models can be efficiently adapted to new domains via self-supervised pre-training without supervised labels, particularly for domain shifts in vision tasks.

Method: Initialize ViT with pre-trained weights (DinoV2/MAE), continue unsupervised pre-training on new domain by unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA, then fine-tune only with LoRA for supervised learning.

Result: Achieves state-of-the-art results on satellite imagery, outperforming fully pre-training and fine-tuning ViTs. Shows up to 8% improvement in linear probing top-1 accuracy using <10% parameters compared to prior fully-tuned approaches.

Conclusion: ExPLoRA effectively bridges domain gaps through efficient self-supervised pre-training, demonstrating superior transfer learning performance with minimal parameter overhead compared to full fine-tuning approaches.

Abstract: Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the-art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available on the project website: https://samar-khanna.github.io/ExPLoRA/

[225] MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Wenqian Ye, Bohan Liu, Guangtao Zheng, Di Wang, Yunsheng Ma, Xu Cao, Bolin Lai, James M. Rehg, Aidong Zhang

Main category: cs.CV

TL;DR: This paper introduces MM-SpuBench, a human-verified benchmark for evaluating spurious biases in Multimodal Large Language Models (MLLMs), revealing their persistent reliance on spurious correlations despite strong vision-language capabilities.

Details

Motivation: While spurious bias is a known robustness issue in classical ML, its presence and severity in Multimodal Large Language Models (MLLMs) remain poorly understood, despite MLLMs' strong joint vision-language understanding capabilities.

Method: The authors create MM-SpuBench, a comprehensive benchmark with human-verified image-class pairs annotated with core and spurious attributes, grounded in a taxonomy of nine distinct spurious correlation types. They evaluate state-of-the-art MLLMs using both standard accuracy and their proposed Conditional Generation Likelihood Advantage (CGLA) metric.

Result: The evaluation reveals persistent reliance on spurious correlations across both open-source and proprietary MLLMs, with difficulty in mitigating these biases even on the carefully constructed benchmark.

Conclusion: The work highlights the need for new technical approaches to mitigate spurious biases in MLLMs and provides a publicly available benchmark (MM-SpuBench) to inspire further research in this important area.

Abstract: Spurious bias, a tendency to exploit spurious correlations between superficial input attributes and prediction targets, has revealed a severe robustness pitfall in classical machine learning problems. Multimodal Large Language Models (MLLMs), which leverage pretrained vision and language models, have recently demonstrated strong capability in joint vision-language understanding. However, both the presence and severity of spurious biases in MLLMs remain poorly understood. In this work, we address this gap by analyzing the spurious biases in the multimodal setting and uncovering the specific inference-time data patterns that can manifest this problem. To support this analysis, we introduce MM-SpuBench, a comprehensive, human-verified benchmark dataset consisting of image-class pairs annotated with core and spurious attributes, grounded in our taxonomy of nine distinct types of spurious correlations. The benchmark is constructed using human-interpretable attribute information to capture a wide range of spurious patterns reflective of real-world knowledge. Leveraging this benchmark, we conduct a comprehensive evaluation of the state-of-the-art open-source and proprietary MLLMs with both standard accuracy and the proposed Conditional Generation Likelihood Advantage (CGLA). Our findings highlight the persistence of reliance on spurious correlations and the difficulty of mitigation on our benchmark. We hope this work can inspire new technical strides to mitigate these biases. Our benchmark is publicly available at https://huggingface.co/datasets/mmbench/MM-SpuBench.

[226] DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Chang-Han Yeh, Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Ting-Hsuan Chen, Yu-Lun Liu

Main category: cs.CV

TL;DR: DiffIR2VR-Zero enables any pre-trained image restoration diffusion model to perform high-quality video restoration without additional training, addressing temporal inconsistency through hierarchical latent warping and hybrid token merging.

Details

Motivation: Image diffusion models excel at restoration but cause temporal inconsistencies when applied to videos, while existing video restoration methods require extensive retraining for different degradation types, limiting their versatility.

Method: Two key innovations: 1) hierarchical latent warping strategy for consistency across keyframes and local frames, 2) hybrid token merging mechanism that adaptively combines optical flow and feature matching.

Result: Achieves superior temporal consistency across diverse datasets and degradation conditions, including challenging 8× super-resolution and severe noise, while maintaining the high-quality restoration of base diffusion models.

Conclusion: Provides a versatile zero-shot framework that works with any image restoration diffusion model for video enhancement without task-specific training or modifications.

Abstract: We present DiffIR2VR-Zero, a zero-shot framework that enables any pre-trained image restoration diffusion model to perform high-quality video restoration without additional training. While image diffusion models have shown remarkable restoration capabilities, their direct application to video leads to temporal inconsistencies, and existing video restoration methods require extensive retraining for different degradation types. Our approach addresses these challenges through two key innovations: a hierarchical latent warping strategy that maintains consistency across both keyframes and local frames, and a hybrid token merging mechanism that adaptively combines optical flow and feature matching. Through extensive experiments, we demonstrate that our method not only maintains the high-quality restoration of base diffusion models but also achieves superior temporal consistency across diverse datasets and degradation conditions, including challenging scenarios like 8$\times$ super-resolution and severe noise. Importantly, our framework works with any image restoration diffusion model, providing a versatile solution for video enhancement without task-specific training or modifications. Project page: https://jimmycv07.github.io/DiffIR2VR_web/

[227] Explaining Object Detectors via Collective Contribution of Pixels

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

Main category: cs.CV

TL;DR: A game-theoretic method using Shapley values and interactions to explain object detectors by capturing both individual and collective pixel contributions, addressing limitations of existing methods that focus only on individual pixels.

Details

Motivation: Existing visual explanation methods for object detectors typically focus only on individual pixel contributions, neglecting the collective influences of multiple pixels working together. This oversight can lead to missing important compositional cues or capturing spurious correlations, limiting the reliability and interpretability of object detection explanations.

Method: Proposes a game-theoretic approach based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. The method provides explanations for both bounding box localization and class determination by highlighting regions crucial for detection decisions.

Result: Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods, providing better explanations for object detection decisions.

Conclusion: The proposed game-theoretic method successfully addresses the limitation of existing approaches by capturing collective pixel contributions, leading to more accurate and comprehensive visual explanations for object detectors, thereby enhancing their reliability and interpretability.

Abstract: Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE

[228] INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Inst-IT enhances Large Multimodal Models for instance-level understanding through explicit visual prompt instruction tuning, improving both fine-grained instance comprehension and general multimodal capabilities.

Details

Motivation: Existing LMMs struggle with instance-level understanding despite good holistic comprehension. Instance-level understanding is crucial for focusing on specific elements of interest, and research shows SOTA LMMs can perform well when given explicit visual cues.

Method: Inst-IT includes: 1) A benchmark to diagnose multimodal instance-level understanding, 2) A large-scale instruction-tuning dataset, and 3) A continuous instruction-tuning training paradigm to enhance spatial-temporal instance understanding capabilities of existing LMMs.

Result: Models enhanced by Inst-IT achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, while also showing significant improvements across various generic image and video understanding benchmarks.

Conclusion: The method not only boosts instance-level understanding but also strengthens overall capabilities of generic image and video comprehension, demonstrating that improving fine-grained understanding enhances general multimodal performance.

Abstract: Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning for instance guidance. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, enhanced by Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

[229] Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Bohan Li, Jiajun Deng, Yasheng Sun, Xiaofeng Wang, Xin Jin, Wenjun Zeng

Main category: cs.CV

TL;DR: Hi-SOP introduces hierarchical context alignment to address feature misalignment in 3D semantic occupancy prediction, improving accuracy through disentangled geometric and temporal alignment.

Details

Motivation: Existing SOP methods suffer from misalignment issues where corresponding features at the same position across different frames have different semantic meanings during aggregation, leading to unreliable contextual fusion and unstable representation learning.

Method: Hi-SOP uses a hierarchical context alignment paradigm: (1) disentangles geometric and temporal context for separate alignment using depth confidence and camera pose priors, (2) globally aligns and composes transformed geometric and temporal volumes based on semantic consistency.

Result: Outperforms state-of-the-art methods for semantic scene completion on SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset.

Conclusion: The hierarchical context alignment approach effectively addresses feature misalignment in SOP, leading to more accurate 3D semantic occupancy prediction through reliable contextual fusion.

Abstract: Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.

[230] OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, Kai Han

Main category: cs.CV

TL;DR: OnlineVPO is a novel preference learning framework for video diffusion models that uses video quality assessment models as feedback sources and introduces an online DPO algorithm for more effective and scalable video quality optimization.

Details

Motivation: Current video diffusion models suffer from degraded image quality and flickering artifacts. Existing preference learning methods for VDMs adopt image-domain routines without proper investigation into video-specific optimization, particularly regarding feedback sources and tuning methodologies.

Method: 1) Uses video quality assessment (VQA) models as feedback sources instead of image-level reward models to better align with human perception of video quality. 2) Introduces an online DPO algorithm with online preference generation and curriculum preference update designs for more effective optimization.

Result: Extensive experiments demonstrate that OnlineVPO is a simple yet effective and scalable preference learning algorithm for video diffusion models, addressing the modality gap issue and enabling optimization of higher-resolution, longer-duration videos.

Conclusion: OnlineVPO provides a tailored preference learning framework for VDMs that better addresses video-specific quality issues through modality-aligned feedback sources and more efficient online optimization methods.

Abstract: Video diffusion models (VDMs) have demonstrated remarkable capabilities in text-to-video (T2V) generation. Despite their success, VDMs still suffer from degraded image quality and flickering artifacts. To address these issues, some approaches have introduced preference learning to exploit human feedback to enhance the video generation. However, these methods primarily adopt the routine in the image domain without an in-depth investigation into video-specific preference optimization. In this paper, we reexamine the design of the video preference learning from two key aspects: feedback source and feedback tuning methodology, and present OnlineVPO, a more efficient preference learning framework tailored specifically for VDMs. On the feedback source, we found that the image-level reward model commonly used in existing methods fails to provide a human-aligned video preference signal due to the modality gap. In contrast, video quality assessment (VQA) models show superior alignment with human perception of video quality. Building on this insight, we propose leveraging VQA models as a proxy of humans to provide more modality-aligned feedback for VDMs. Regarding the preference tuning methodology, we introduce an online DPO algorithm tailored for VDMs. It not only enjoys the benefits of superior scalability in optimizing videos with higher resolution and longer duration compared with the existing method, but also mitigates the insufficient optimization issue caused by off-policy learning via online preference generation and curriculum preference update designs. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and, more importantly, scalable preference learning algorithm for video diffusion models.

[231] EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, Nan Cao

Main category: cs.CV

TL;DR: EmotiCrafter: A model for continuous emotional image generation using text prompts and Valence-Arousal values to capture complex emotional nuances.

Details

Motivation: Existing emotional image generation methods rely on discrete emotion categories, which fail to capture complex and subtle emotional nuances. They also struggle to control specific image content based on text prompts.

Method: Introduces EmotiCrafter with a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, plus a specialized loss function to enhance emotion expression.

Result: The method effectively generates images representing specific emotions with desired content and outperforms existing techniques.

Conclusion: Proposes a new task (C-EICG) and presents EmotiCrafter as a solution for continuous emotional image generation that better captures emotional nuances while maintaining content control.

Abstract: Recent research shows that emotions can enhance users’ cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.

[232] MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, Ivan Oseledets

Main category: cs.CV

TL;DR: MaxInfo is a training-free frame selection method for Video LLMs that uses maximum volume principle to pick the most informative frames, reducing redundancy and improving video understanding performance without additional training.

Details

Motivation: Uniform frame sampling in Video LLMs often fails to capture critical information due to frame redundancy and content variations, leading to suboptimal video understanding performance.

Method: MaxInfo uses the maximum volume principle to select the most representative frames by maximizing the geometric volume formed by selected embeddings in the embedding space. It comes in Fast, Slow, and Chunk-based versions to handle different video lengths and computational requirements.

Result: Achieves significant improvements: 3.28% on LongVideoBench and 6.4% on EgoSchema for LLaVA-Video-7B; 3.47% on LongVideoBench for LLaVA-Video-72B; 3.44% on LongVideoBench for MiniCPM4.5. Works with existing VLLMs without training and has low latency.

Conclusion: MaxInfo provides a simple, practical, and effective alternative to uniform sampling that enhances video understanding by selecting more informative frames, improving performance across benchmarks while being training-free and compatible with existing VLLMs.

Abstract: Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, the first training-free method based on the maximum volume principle, which is available in Fast and Slow versions and a Chunk-based version that selects and retains the most representative frames from a video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. Moreover, MaxInfo boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5. The approach is simple to implement and works with existing VLLMs without the need for additional training and very lower latency, making it a practical and effective alternative to traditional uniform sampling methods. Our code are available at https://github.com/FusionBrainLab/MaxInfo.git

[233] An Empirical Study of Methods for Small Object Detection from Satellite Imagery

Xiaohui Yuan, Aniv Chakravarty, Lichuan Gu, Zhenchun Wei, Elinor Lichtenberg, Tian Chen

Main category: cs.CV

TL;DR: Empirical review and evaluation of four state-of-the-art object detection methods for small objects in remote sensing imagery, focusing on car and bee box detection scenarios.

Details

Motivation: To understand the performance and technical challenges of object detection methods specifically for small objects in remote sensing imagery, which is crucial for applications like urban monitoring and agricultural management.

Method: Empirical evaluation of four state-of-the-art object detection methods identified from existing surveys and literature, using public high-resolution satellite image datasets with two application scenarios: car detection from urban images and bee box detection from agricultural lands.

Result: The paper provides comparative performance analysis and insights into the technical challenges of small object detection in remote sensing imagery, though specific numerical results are not provided in the abstract.

Conclusion: The empirical study offers valuable insights into method performance and technical challenges for small object detection in remote sensing applications, highlighting the need for specialized approaches for different object types and scenarios.

Abstract: This paper reviews object detection methods for finding small objects from remote sensing imagery and provides an empirical evaluation of four state-of-the-art methods to gain insights into method performance and technical challenges. In particular, we use car detection from urban satellite images and bee box detection from satellite images of agricultural lands as application scenarios. Drawing from the existing surveys and literature, we identify several top-performing methods for the empirical study. Public, high-resolution satellite image datasets are used in our experiments.

[234] Daily Land Surface Temperature Reconstruction in Landsat Cross-Track Areas Using Deep Ensemble Learning With Uncertainty Quantification

Shengjie Liu, Siqin Wang, Lu Zhang

Main category: cs.CV

TL;DR: DELAG is a deep ensemble learning method that reconstructs high-resolution Landsat land surface temperature data in complex urban areas by integrating annual temperature cycles and Gaussian processes, improving data availability to 4 scenes every 16 days.

Details

Motivation: Many applications need high spatiotemporal resolution LST data, but Landsat has limited revisit time and cloud cover issues, especially in complex urban areas where LST varies dramatically within and across city blocks.

Method: Deep ensemble learning method (DELAG) that integrates annual temperature cycles and Gaussian processes to reconstruct Landsat LST. Leverages cross-track characteristics and dual-satellite operation of Landsat since 2021 to enhance data availability to 4 scenes every 16 days.

Result: Successfully reconstructed LST in New York City, London, and Hong Kong under clear-sky (RMSE = 0.73-0.96 K) and heavily-cloudy (RMSE = 0.84-1.62 K) situations, outperforming existing methods. Can quantify uncertainty and estimate near-surface air temperature with comparable accuracy to clear-sky LST.

Conclusion: DELAG provides a novel and practical method for Landsat LST reconstruction, particularly suited for complex urban areas within Landsat cross-track areas, advancing high spatiotemporal resolution climate monitoring.

Abstract: Many real-world applications rely on land surface temperature (LST) data at high spatiotemporal resolution. In complex urban areas, LST exhibits significant variations, fluctuating dramatically within and across city blocks. Landsat provides high spatial resolution data at 100 meters but is limited by long revisit time, with cloud cover further disrupting data collection. Here, we propose DELAG, a deep ensemble learning method that integrates annual temperature cycles and Gaussian processes, to reconstruct Landsat LST in complex urban areas. Leveraging the cross-track characteristics and dual-satellite operation of Landsat since 2021, we further enhance data availability to 4 scenes every 16 days. We select New York City, London and Hong Kong from three different continents as study areas. Experiments show that DELAG successfully reconstructed LST in the three cities under clear-sky (RMSE = 0.73-0.96 K) and heavily-cloudy (RMSE = 0.84-1.62 K) situations, superior to existing methods. Additionally, DELAG can quantify uncertainty that enhances LST reconstruction reliability. We further tested the reconstructed LST to estimate near-surface air temperature, achieving results (RMSE = 1.48-2.11 K) comparable to those derived from clear-sky LST (RMSE = 1.63-2.02 K). The results demonstrate the successful reconstruction through DELAG and highlight the broader applications of LST reconstruction for estimating accurate air temperature. Our study thus provides a novel and practical method for Landsat LST reconstruction, particularly suited for complex urban areas within Landsat cross-track areas, taking one step toward addressing complex climate events at high spatiotemporal resolution. Code and data are available at https://skrisliu.com/delag

[235] SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition

Shanshan Wan, Yingmei Wei, Lai Kang, Tianrui Shen, Haixuan Wang, Yee-Hong Yang

Main category: cs.CV

TL;DR: SciceVPR is a Visual Place Recognition model that uses DINOv2 backbone with multi-layer feature fusion and stable cross-image correlation distillation to produce robust global descriptors that handle domain shifts like illumination and viewpoint changes.

Details

Motivation: Current VPR models using DINOv2 only utilize its final output and suffer from unstable cross-image correlation, leading to inconsistent retrieval results. There's a need for more discriminative and stable global descriptors that can handle domain shifts.

Method: 1) Multi-layer feature fusion module to capture detailed channel and spatial information from DINOv2’s multi-layer outputs. 2) Self-enhanced encoder that distills invariant correlation knowledge between images within a batch to produce robust features.

Result: SciceVPR-B outperforms SOTA one-stage methods on multiple datasets with varying domain conditions. SciceVPR-L performs on par with SOTA two-stage models, achieving over 3% higher Recall@1 on the challenging Tokyo24/7 dataset.

Conclusion: SciceVPR successfully leverages DINOv2’s full potential through multi-layer feature fusion and stable cross-image correlation distillation, producing robust global descriptors that handle domain shifts effectively while maintaining competitive performance.

Abstract: Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at https://github.com/shuimushan/SciceVPR.

[236] Simple Self Organizing Map with Visual Transformer

Alan Luo, Kaiwen Yuan

Main category: cs.CV

TL;DR: Vision Transformers underperform on small datasets due to lack of inductive biases. Self-Organizing Maps offer inherent topological preservation that can complement ViTs. This paper explores synergistic integration of ViTs and SOMs to improve performance on limited data tasks.

Details

Motivation: ViTs lack inductive biases and underperform on small datasets. Current solutions use indirect approaches like pretext tasks or CNN knowledge distillation. SOMs have inherent topological preservation properties that could directly address ViT limitations, but integration with modern deep learning architectures remains unexplored.

Method: Novel exploration of how Vision Transformers and Self-Organizing Maps can empower each other. The study bridges the research gap by investigating synergistic integration between these architectures.

Result: The architectures demonstrate synergistic enhancement, leading to significantly improved performance in both unsupervised and supervised tasks.

Conclusion: ViTs and SOMs can effectively complement each other, with SOMs providing the inductive biases that ViTs lack, particularly for small datasets. This integration represents a promising direction for improving vision models on limited data.

Abstract: Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code is publicly available on GitHub.

[237] Illuminating Darkness: Learning to Enhance Low-light Images In-the-Wild

S M A Sharif, Abdur Rehman, Zain Ul Abidin, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi

Main category: cs.CV

TL;DR: The paper introduces LSD, a large-scale 4K+ low-light smartphone dataset, and TFFormer, a hybrid model that separately encodes luminance and chrominance with cross-attention fusion for state-of-the-art low-light enhancement.

Details

Motivation: Single-shot low-light image enhancement (SLLIE) faces limitations due to scarce diverse real-world paired datasets, hindering model generalization and performance.

Method: 1) Create LSD dataset with 6,425 aligned low/normal-light pairs from 8,000+ scenes; 2) Propose TFFormer with separate LC encoding, cross-attention joint decoder, LC refinement, and LC-guided supervision.

Result: TFFormer achieves +2.45 dB PSNR improvement on LSD and +6.80 mAP on ExDark for low-light object detection, demonstrating superior performance and generalization.

Conclusion: The LSD dataset and TFFormer model significantly advance low-light image enhancement by providing high-quality training data and an effective architecture that improves both enhancement quality and downstream vision tasks.

Abstract: Single-shot low-light image enhancement (SLLIE) remains challenging due to the limited availability of diverse, real-world paired datasets. To bridge this gap, we introduce the Low-Light Smartphone Dataset (LSD), a large-scale, high-resolution (4K+) dataset collected in the wild across a wide range of challenging lighting conditions (0.1 to 200 lux). LSD contains 6,425 precisely aligned low and normal-light image pairs, selected from over 8,000 dynamic indoor and outdoor scenes through multi-frame acquisition and expert evaluation. To evaluate generalization and aesthetic quality, we collect 2,117 unpaired low-light images from previously unseen devices. To fully exploit LSD, we propose TFFormer, a hybrid model that encodes luminance and chrominance (LC) separately to reduce color-structure entanglement. We further propose a cross-attention-driven joint decoder for context-aware fusion of LC representations, along with LC refinement and LC-guided supervision to significantly enhance perceptual fidelity and structural consistency. TFFormer achieves state-of-the-art results on LSD (+2.45 dB PSNR) and substantially improves downstream vision tasks, such as low-light object detection (+6.80 mAP on ExDark).

[238] Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour

Main category: cs.CV

TL;DR: This paper introduces a new embedding-based data heterogeneity formulation for Federated Learning that better captures performance degradation in computer vision tasks beyond classification, showing up to 60% loss increase under FedAvg.

Details

Motivation: Existing FL research predominantly uses label distribution skew to emulate data heterogeneity, but this fails to fully capture heterogeneity in computer vision tasks beyond classification, exposing a gap in the literature.

Method: Utilize pre-trained deep neural networks to extract task-specific data embeddings, define task-specific data heterogeneity through embeddings, cluster data points based on embeddings, and distribute them among clients using Dirichlet distribution.

Result: Embedding-based heterogeneity formulation leads to up to ~60% increase in observed loss under FedAvg across seven representative computer vision tasks, indicating it more accurately exposes performance degradation caused by data heterogeneity.

Conclusion: The paper introduces a more accurate embedding-based data heterogeneity formulation for FL evaluation, reveals significant performance degradation not captured by label skew, and identifies open research directions for future work.

Abstract: Federated Learning (FL) has emerged as one of the prominent paradigms for distributed machine learning (ML). However, it is well-established that its performance can degrade significantly under non-IID (non-independent and identically distributed) data distributions across clients. To study this effect, the existing works predominantly emulate data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the data heterogeneity in computer vision tasks beyond classification, exposing an overlooked gap in the literature. Motivated by this, by utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. For instance, across seven representative computer vision tasks, our embedding-based heterogeneity formulation leads to up to around 60% increase in the observed loss under FedAvg, indicating that it more accurately exposes the performance degradation caused by data heterogeneity. We further unveil a series of open research directions that can be pursued.

[239] Text-to-Image Models and Their Representation of People from Different Nationalities Engaging in Activities

Abdulkareem Alsudais

Main category: cs.CV

TL;DR: T2I models (DALL-E 3, Gemini 3 Pro Preview) show systematic biases in depicting people from different nationalities, disproportionately showing traditional attire especially for Middle Eastern, African, and lower-income countries, even in inappropriate contexts.

Details

Motivation: To investigate how text-to-image models represent people from different nationalities in everyday activities and identify potential biases in their depictions.

Method: Generated 2,060 images using prompts specifying 206 nationalities across 5 everyday activities. Analyzed traditional attire prevalence, regional/income patterns, and used CLIP, ALIGN, GPT-4.1 mini to assess image-text alignment across 9,270 image-prompt pairs.

Result: 28.4% of images showed traditional attire, disproportionately affecting Middle East & North Africa and Sub-Saharan Africa, with income-linked patterns. Traditional attire images scored higher alignment when prompts included country names. One model frequently inserted “traditional” in revised prompts.

Conclusion: T2I models exhibit systematic representational biases shaped by multiple pipeline components (image generators, evaluation models, prompt revision), potentially reinforcing stereotypes about certain regions and income groups.

Abstract: This paper investigates how popular text-to-image (T2I) models, DALL-E 3 and Gemini 3 Pro Preview, depict people from 206 nationalities when prompted to generate images of individuals engaging in common everyday activities. Five scenarios were developed, and 2,060 images were generated using input prompts that specified nationalities across five activities. When aggregating across activities and models, results showed that 28.4% of the images depicted individuals wearing traditional attire, including attire that is impractical for the specified activities in several cases. This pattern was statistically significantly associated with regions, with the Middle East & North Africa and Sub-Saharan Africa disproportionately affected, and was also associated with World Bank income groups. Similar region- and income-linked patterns were observed for images labeled as depicting impractical attire in two athletics-related activities. To assess image-text alignment, CLIP, ALIGN, and GPT-4.1 mini were used to score 9,270 image-prompt pairs. Images labeled as featuring traditional attire received statistically significantly higher alignment scores when prompts included country names, and this pattern weakened or reversed when country names were removed. Revised prompt analysis showed that one model frequently inserted the word “traditional” (50.3% for traditional-labeled images vs. 16.6% otherwise). These results indicate that these representational patterns can be shaped by several components of the pipeline, including image generator, evaluation models, and prompt revision.

[240] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

Main category: cs.CV

TL;DR: CPL introduces contrastive prompt learning with sparse prompts and contrastive regularization to improve task-aware prompts for All-in-One Image Restoration, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Current AiOIR approaches struggle with designing effective task-aware prompts. Adaptive prompt learning leads to overlapping/redundant representations, while explicit prompts from classifiers lose visual information needed for reconstruction.

Method: Contrastive Prompt Learning (CPL) with two components: Sparse Prompt Module (SPM) to capture degradation-aware representations with reduced redundancy, and Contrastive Prompt Regularization (CPR) to strengthen task boundaries using negative prompts across different degradation types.

Result: Extensive experiments across five benchmarks show CPL consistently boosts performance of strong AiOIR baselines, achieving state-of-the-art average performance across diverse scenarios.

Conclusion: CPL provides a general and robust solution for AiOIR by directly optimizing prompt-restoration model interaction rather than focusing on degradation classification, improving prompt-task alignment.

Abstract: All-in-One Image Restoration (AiOIR), which addresses diverse degradation types with a unified model, presents significant challenges in designing task-aware prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but discard critical visual information needed for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a framework that aims to improve prompt-task alignment through two complementary components: a Sparse Prompt Module (SPM) that efficiently captures degradation-aware representations while reducing redundancy, and a Contrastive Prompt Regularization (CPR) that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL directly optimizes the interaction between prompts and the restoration model. Extensive experiments across five benchmarks show that CPL consistently boosts the performance of strong AiOIR baselines across diverse scenarios. Our approach achieves state-of-the-art average performance on these benchmarks, providing a general and robust solution for AiOIR. The code is available at https://github.com/Aitical/CPLIR

[241] Adapting In-Domain Few-Shot Segmentation to New Domains without Source Domain Retraining

Qi Fan, Kaiqi Liu, Nian Liu, Hisham Cholakkal, Rao Muhammad Anwer, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: ISA adapts pre-trained FSS models to new domains without retraining by identifying and training domain-specific model structures using few-shot support samples during inference.

Details

Motivation: Cross-domain few-shot segmentation faces challenges with diverse target domains and limited support data. Existing methods require costly redesign and retraining of models using abundant source domain data, which is inefficient.

Method: 1) Adaptively identify domain-specific model structures using a novel structure Fisher score to measure parameter importance. 2) Progressively train selected informative structures with hierarchically constructed training samples from fewer to more support shots.

Result: Extensive experiments show superior performance across multiple CD-FSS benchmarks. The method effectively addresses domain shifts and provides flexible adaptation without redesigning or retraining models on base data.

Conclusion: ISA enables existing well-trained in-domain FSS models to adapt to new domains using few-shot support samples during inference, eliminating the need for costly retraining on source domain data while maintaining strong performance.

Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using abundant base data from the source domain, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for source domain retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks. Codes are at https://github.com/fanq15/ISA.

[242] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang

Main category: cs.CV

TL;DR: MCITlib is a comprehensive library for Multimodal Continual Instruction Tuning that implements 8 algorithms, evaluates on 3 benchmarks with 2 backbone models to address challenges in Multimodal Continual Learning.

Details

Motivation: The rise of Multimodal Large Language Models (MLLMs) introduces new challenges in Multimodal Continual Learning (MCL), where models must handle both catastrophic forgetting and cross-modal coordination, requiring specialized tools and frameworks.

Method: Developed MCITlib, a comprehensive library that implements 8 representative algorithms for Multimodal Continual Instruction Tuning, with evaluations conducted on 3 benchmarks using 2 backbone models.

Result: Created an open-source library (MCITlib) that provides tools for MCL research, with code released at https://github.com/Ghy0501/MCITlib, and plans for continuous updates to support future developments in the field.

Conclusion: MCITlib advances MCL research by providing a standardized framework for evaluating and developing multimodal continual learning methods, addressing the unique challenges of catastrophic forgetting and cross-modal coordination in MLLMs.

Abstract: Continual learning enables AI systems to acquire new knowledge while retaining previously learned information. While traditional unimodal methods have made progress, the rise of Multimodal Large Language Models (MLLMs) brings new challenges in Multimodal Continual Learning (MCL), where models are expected to address both catastrophic forgetting and cross-modal coordination. To advance research in this area, we present MCITlib, a comprehensive library for Multimodal Continual Instruction Tuning. MCITlib currently implements 8 representative algorithms and conducts evaluations on 3 benchmarks under 2 backbone models. The library will be continuously updated to support future developments in MCL. The codebase is released at https://github.com/Ghy0501/MCITlib.

[243] GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

Haibin He, Jing Zhang, Maoyuan Ye, Juhua Liu, Bo Du, Dacheng Tao

Main category: cs.CV

TL;DR: GoMatching++ transforms image text spotters into video specialists by freezing the image model and adding a lightweight tracker, achieving state-of-the-art VTS performance with minimal training data and parameters.

Details

Motivation: Current video text spotters have limited recognition capability compared to image text spotters, even after extensive training. There's a need for parameter- and data-efficient methods that can leverage powerful image spotters for video tasks.

Method: Freezes an off-the-shelf image text spotter and introduces a lightweight trainable tracker. Key components: (1) rescoring mechanism to bridge image-video domain gap, (2) LST-Matcher to enhance frozen spotter’s video text handling. Explores various LST-Matcher architectures for efficiency.

Result: Sets new performance records on ICDAR15-video, DSText, and BOVText benchmarks while significantly reducing training costs. Introduces ArTVideo benchmark with over 30% curved text annotations to address lack of curved text datasets in VTS.

Conclusion: GoMatching++ provides an efficient way to transform image text spotters into video specialists, and the ArTVideo benchmark addresses the curved text gap in VTS. Both contributions are expected to advance video text spotting research.

Abstract: Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter’s ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.

Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, Roni Sengupta

Main category: cs.CV

TL;DR: TalkingHeadBench is a new comprehensive benchmark and dataset for evaluating deepfake talking-head detection methods against state-of-the-art generators, addressing limitations of current outdated benchmarks.

Details

Motivation: Current deepfake detection benchmarks are outdated and don't reflect recent advances in generative models, failing to provide insights into detector robustness and generalization against modern threats.

Method: Created a multi-model multi-generator benchmark with deepfakes from leading academic/commercial models, designed protocols for assessing generalization under identity and generator distribution shifts, and benchmarked diverse detection methods (CNNs, vision transformers, temporal models).

Result: Provides comprehensive evaluation of detection methods, includes error analysis using Grad-CAM visualizations to expose failure modes and detector biases, and offers open access to all data splits and protocols.

Conclusion: TalkingHeadBench aims to accelerate research toward more robust and generalizable detection models to address rapidly evolving generative techniques in the face of substantial risks in media, politics, and finance.

Abstract: The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a comprehensive multi-model multi-generator benchmark and curated dataset designed to evaluate the performance of state-of-the-art detectors on the most advanced generators. Our dataset includes deepfakes synthesized by leading academic and commercial models and features carefully constructed protocols to assess generalization under distribution shifts in identity and generator characteristics. We benchmark a diverse set of existing detection methods, including CNNs, vision transformers, and temporal models, and analyze their robustness and generalization capabilities. In addition, we provide error analysis using Grad-CAM visualizations to expose common failure modes and detector biases. TalkingHeadBench is hosted on https://huggingface.co/datasets/luchaoqi/TalkingHeadBench with open access to all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

[245] Controllable Human-centric Keyframe Interpolation with Generative Prior

Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy

Main category: cs.CV

TL;DR: PoseFuse3D-KI is a novel framework for controllable human-centric keyframe interpolation that integrates 3D human guidance (SMPL-X parameters) into video diffusion models to improve interpolation of complex articulated human motions.

Details

Motivation: Existing video interpolation methods struggle with complex articulated human motions due to lack of 3D geometric guidance, offering limited control over synthesized dynamics. The authors aim to address these limitations by incorporating 3D human pose information.

Method: Propose PoseFuse3D-KI framework with: 1) PoseFuse3D control model featuring novel SMPL-X encoder that transforms 3D geometry/shape into 2D latent conditioning space, 2) Fusion network integrating 3D cues with 2D pose embeddings, 3) CHKI-Video dataset with 2D poses and 3D SMPL-X annotations for evaluation.

Result: Outperforms state-of-the-art baselines on CHKI-Video dataset with 9% improvement in PSNR and 38% reduction in LPIPS. Comprehensive ablations demonstrate improved interpolation fidelity from the PoseFuse3D model.

Conclusion: Integrating 3D human guidance signals into diffusion processes enables more controllable and higher-fidelity human-centric keyframe interpolation, particularly for complex articulated motions where previous methods struggled.

Abstract: Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Pengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, Sifeng He

Main category: cs.CV

TL;DR: MAPLE uses MLLMs’ inherent alignment capabilities to guide cross-modal representation learning through preference optimization, achieving better fine-grained retrieval than CLIP-based approaches.

Details

Motivation: CLIP has a substantial modality gap in its feature space, and while MLLMs show inherent modality alignment properties, existing MLLM-based retrievers use coarse alignment mechanisms that limit their potential.

Method: MAPLE leverages MLLMs’ fine-grained alignment priors through reinforcement learning with two components: (1) automatic preference data construction using off-the-shelf MLLMs, and (2) Relative Preference Alignment (RPA) loss that adapts Direct Preference Optimization to embedding learning.

Result: Experimental results show substantial gains in fine-grained cross-modal retrieval, demonstrating effectiveness in handling nuanced semantic distinctions.

Conclusion: Preference-guided alignment using MLLMs’ inherent capabilities effectively bridges the modality gap and improves cross-modal representation learning for fine-grained retrieval tasks.

Abstract: Despite Contrastive Language-Image Pretraining (CLIP)’s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.

[247] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

Main category: cs.CV

TL;DR: OmniVCus is a diffusion Transformer framework for multi-subject video customization and control using depth/mask signals, featuring novel data construction and embedding mechanisms.

Details

Motivation: Existing methods focus on single-subject customization due to lack of multi-subject training data, and there's limited exploration of using control signals (depth, mask, camera, text) to edit subjects in customized videos.

Method: 1) VideoCus-Factory pipeline for constructing multi-subject training data from raw videos without labels; 2) IVTM training with image editing data for instructive editing; 3) OmniVCus diffusion Transformer with Lottery Embedding (enables more subjects) and Temporally Aligned Embedding (extracts guidance from control signals).

Result: Method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations, with code, models, and data publicly released.

Conclusion: The proposed framework successfully addresses multi-subject video customization and control signal integration challenges through novel data construction and embedding mechanisms.

Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code, models, data are released at https://github.com/caiyuanhao1998/Open-OmniVCus

[248] Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

Andrei Jelea, Ahmed Nabil Belbachir, Marius Leordeanu

Main category: cs.CV

TL;DR: GTTA is a general test-time augmentation method that works across vision and non-vision tasks by perturbing PCA subspace projections and using self-supervised learning to reduce computational cost.

Details

Motivation: Existing test-time augmentation methods lack generality and often require task-specific adaptations. The authors aim to create a general-purpose TTA method that works off-the-shelf for diverse tasks including classification, regression, segmentation, and detection.

Method: GTTA uses a novel data transformation that randomly perturbs PCA subspace projections of test inputs to create diverse augmented samples. It includes a self-supervised learning stage where ensemble outputs act as an unsupervised teacher to train the initial single student model, reducing test-time computation.

Result: GTTA outperforms strong TTA approaches and state-of-the-art models across various vision and non-vision datasets including image classification, segmentation, pneumonia detection, speech recognition, and house price prediction. It also proves effective on the specialized task of salmon segmentation in underwater videos using the new DeepSalmon dataset.

Conclusion: GTTA provides a general, effective test-time augmentation method that works across diverse tasks without task-specific modifications, offering both performance improvements and computational efficiency through its self-supervised learning stage.

Abstract: We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA creates valid augmented samples from the data distribution with high diversity, properties we theoretically show that are essential for a Test-Time Augmentation method to be effective. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost. Our comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, pneumonia detection, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

[249] One Graph to Track Them All: Dynamic GNNs for Single- and Multi-View Tracking

Martin Engilberge, Ivan Vrkic, Friedrich Wilke Grosche, Julien Pilet, Engin Turetken, Pascal Fua

Main category: cs.CV

TL;DR: A differentiable model for multi-people tracking that learns to associate detections into trajectories using a dynamic spatiotemporal graph, with a new large-scale multi-view dataset.

Details

Motivation: To create a unified, end-to-end differentiable model for multi-people tracking that doesn't rely on pre-computed tracklets, and to address occlusion challenges in complex multi-view scenarios.

Method: Builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across sequences. The graph can encode scene-specific information to handle occlusions better.

Result: Achieves state-of-the-art performance on public benchmarks and the new dataset, demonstrating flexibility across diverse conditions.

Conclusion: The proposed differentiable model with dynamic spatiotemporal graph representation effectively handles multi-people tracking, especially in occluded scenarios, and the new large-scale dataset with 25 overlapping views will advance research in this field.

Abstract: This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.

[250] SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, Lihua Xie

Main category: cs.CV

TL;DR: SplatSSC: A monocular 3D semantic scene completion framework using depth-guided Gaussian initialization and decoupled aggregation to improve efficiency and accuracy.

Details

Motivation: Existing object-centric SSC methods using 3D Gaussian primitives suffer from inefficient random initialization and outlier primitives that cause erroneous artifacts, limiting performance and efficiency.

Method: Proposes SplatSSC with: 1) Depth-guided initialization using Group-wise Multi-scale Fusion (GMF) module to generate sparse representative Gaussian primitives, and 2) Decoupled Gaussian Aggregator (DGA) that separates geometric and semantic predictions during Gaussian-to-voxel splatting to handle outliers, plus a Probability Scale Loss.

Result: Achieves state-of-the-art on Occ-ScanNet dataset with over 6.3% improvement in IoU and 4.1% in mIoU, while reducing latency and memory cost by more than 9.3%.

Conclusion: SplatSSC effectively addresses limitations of random initialization and outlier primitives in Gaussian-based SSC, delivering superior performance with improved efficiency through principled depth guidance and robust aggregation.

Abstract: Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory cost by more than 9.3%.

[251] evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

Rodrigo Verschae, Ignacio Bugueno-Cordova

Main category: cs.CV

TL;DR: evTransFER: A transfer learning framework for facial expression recognition using event-based cameras, achieving state-of-the-art results on synthetic and real neuromorphic datasets.

Details

Motivation: Event-based cameras offer unique advantages for facial expression recognition with microsecond latency and high dynamic range, but existing methods struggle with real sensor noise and sparse activity in neuromorphic data.

Method: Transfer learning framework with adversarial generative training for facial reconstruction, then transferring encoder weights to FER system. Architecture includes LSTM for longer-term dynamics and introduces TIE event representation.

Result: Achieved 93.6% recognition rate on synthetic e-CK+ dataset and 76.7% accuracy on real NEFER dataset, surpassing state-of-the-art methods and outperforming models trained from scratch.

Conclusion: The transfer learning approach effectively handles real sensor noise and sparse activity in event-based facial expression recognition, demonstrating superior performance over training from scratch on both synthetic and real neuromorphic data.

Abstract: Event-based cameras are bio-inspired sensors that asynchronously capture pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing information on the spatiotemporal dynamics of a scene. We propose evTransFER, a transfer learning-based framework for facial expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode facial spatiotemporal dynamics, built by training an adversarial generative method on facial reconstruction and transferring the encoder weights to the facial expression recognition system. We demonstrate that the proposed transfer learning method improves facial expression recognition compared to training a network from scratch. We propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics and introduces a new event-based representation called TIE. We evaluated the framework using both the synthetic event-based facial expression database e-CK+ and the real neuromorphic dataset NEFER. On e-CK+, evTransFER achieved a recognition rate of 93.6%, surpassing state-of-the-art methods. For NEFER, which comprises event sequence with real sensor noise and sparse activity, the proposed transfer learning strategy achieved an accuracy of up to 76.7%. In both datasets, the outcomes surpassed current methodologies and exceeded results when compared with models trained from scratch.

[252] CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning

Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao

Main category: cs.CV

TL;DR: CVBench is the first diagnostic benchmark for evaluating cross-video relational reasoning in multimodal LLMs, revealing significant performance gaps compared to humans and identifying architectural bottlenecks.

Details

Motivation: Current MLLMs perform well on single-video tasks but lack capability for spatiotemporal pattern reasoning across multiple videos, which is essential for real-world applications like multi-camera surveillance and cross-video procedural learning.

Method: Created CVBench with 1,000 QA pairs across three hierarchical tiers: cross-video object association, cross-video event association, and cross-video complex reasoning. Built from five domain-diverse video clusters and evaluated 10+ leading MLLMs under zero-shot or chain-of-thought prompting.

Result: Top models like GPT-4o achieve only 63.5% accuracy on causal reasoning tasks vs. 91.3% human performance. Analysis reveals fundamental bottlenecks: deficient inter-video context retention and poor disambiguation of overlapping entities.

Conclusion: CVBench establishes a rigorous framework for advancing pattern recognition in multi-video scenarios and provides architectural insights for next-generation models, highlighting critical gaps in current MLLM capabilities.

Abstract: While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their capability for spatiotemporal pattern reasoning across multiple videos remains a critical gap in pattern recognition research. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first diagnostic benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to analyze and integrate spatiotemporal patterns from dynamic visual streams. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 63.5% accuracy on causal reasoning tasks, compared to the 91.3% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLMs architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for advancing pattern recognition methodologies in multi-video scenarios, providing architectural insights for next-generation models. The data and evaluation code are available at: https://github.com/Hokhim2/CVBench.

[253] Aligned Anchor Groups Guided Line Segment Detector

Zeyu Li, Annan Shu

Main category: cs.CV

TL;DR: AAGLSD is a novel line segment detector using hierarchical anchor groups for high-precision detection with simple validation instead of complex refinement.

Details

Motivation: To develop a line segment detector that achieves both high precision and completeness in extracting line segments from images, addressing limitations of existing methods.

Method: Uses hierarchical approach with regular anchors and aligned anchor groups, sequentially linking anchors while updating predicted line segments, followed by simple validation and merging of adjacent segments.

Result: Quantitative experiments on various datasets show AAGLSD effectively extracts complete line segments compared to other advanced detectors.

Conclusion: AAGLSD provides an effective solution for line segment detection with high precision and completeness, avoiding complex refinement strategies while maintaining performance.

Abstract: This paper introduces a novel line segment detector, the Aligned Anchor Groups guided Line Segment Detector (AAGLSD), designed to detect line segments from images with high precision and completeness. The algorithm employs a hierarchical approach to extract candidate pixels with different saliency levels, including regular anchors and aligned anchor groups. AAGLSD initiates from these aligned anchor groups, sequentially linking anchors and updating the currently predicted line segment simultaneously. The final predictions are derived through straightforward validation and merging of adjacent line segments, avoiding complex refinement strategies. AAGLSD is evaluated on various datasets and quantitative experiments demonstrate that the proposed method can effectively extract complete line segments from input images compared to other advanced line segment detectors. The implementation is available at https://github.com/zyl0609/AAGLSD.

[254] Bidirectional Sparse Attention for Faster Video Diffusion Training

Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang

Main category: cs.CV

TL;DR: BSA is a Bidirectional Sparse Attention framework that dynamically sparsifies both Queries and Key-Value pairs in video DiT models, achieving up to 20x FLOPs reduction and 17.79x faster attention training while maintaining generative quality.

Details

Motivation: Video diffusion Transformer models face computational bottlenecks due to quadratic complexity of full attention when generating high-resolution, long-duration videos. Full attention suffers from excessive computation from sparse Q-KV pairs and redundant computation from fixed sparse patterns that don't leverage DiT's dynamic attention.

Method: Proposes Bidirectional Sparse Attention (BSA) framework with two key components: 1) Query sparsity optimization via semantic similarity selection of informative query tokens with dynamic spatial-time training strategy, and 2) KV sparsity achieved by computing statistical dynamic threshold to retain only the most salient KV blocks.

Result: BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

Conclusion: BSA effectively overcomes computational bottlenecks in video DiT models by dynamically sparsifying both queries and key-value pairs, making high-resolution, long-duration video generation more computationally feasible without sacrificing quality.

Abstract: Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT’s dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

[255] A Novel Compression Framework for YOLOv8: Achieving Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation

Melika Sabaghian, Mohammad Ali Keyvanrad, Seyyedeh Mahila Moghadami

Main category: cs.CV

TL;DR: A three-stage compression pipeline for YOLOv8 that combines sparsity-aware training, structured channel pruning, and channel-wise knowledge distillation achieves 73.5% parameter reduction with minimal accuracy loss, enabling real-time aerial object detection on edge devices.

Details

Motivation: Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without compromising performance, particularly for real-time applications on edge devices.

Method: A novel three-stage compression pipeline: 1) Sparsity-aware training with dynamic sparsity during optimization, 2) Structured channel pruning using batch normalization scaling factors to eliminate redundant channels, and 3) Channel-Wise Knowledge Distillation (CWD) with adjustable temperature and loss weighting to recover accuracy after pruning.

Result: For YOLOv8m: Parameters reduced from 25.85M to 6.85M (73.51% reduction), FLOPs from 49.6G to 13.3G, MACs from 101G to 34.5G, with only 2.7% AP50 drop. Inference speed increased from 26 FPS to 45 FPS, and further to 68 FPS with TensorRT optimization (AP50: 47.9 to 47.6).

Conclusion: The proposed compression pipeline effectively balances model size reduction and detection accuracy, enabling real-time deployment of aerial object detection models on resource-constrained edge devices while maintaining practical performance levels.

Abstract: Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without com-promising performance. In this study, we propose a novel three-stage compression pipeline for the YOLOv8 object detection model, integrating sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity during model optimization, effectively balancing parameter reduction and detection accuracy. Second, we apply structured channel pruning by leveraging batch normalization scaling factors to eliminate redundant channels, significantly reducing model size and computational complexity. Finally, to mitigate the accuracy drop caused by pruning, we employ CWD to transfer knowledge from the original model, using an adjustable temperature and loss weighting scheme tailored for small and medium object detection. Extensive experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to 13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The resulting compressed model achieves 47.9 AP50 and boosts inference speed from 26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge devices. We further apply TensorRT as a lightweight optimization step. While this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly improves inference speed from 45 to 68 FPS, demonstrating the practicality of our approach for high-throughput, re-source-constrained scenarios.

[256] Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

Junxiao Xue, Quan Deng, Xuecheng Wu, Kelu Yao, Xinyi Yin, Fei Yu, Wei Zhou, Yanfei Zhong, Yang Liu, Dingkang Yang

Main category: cs.CV

TL;DR: The paper introduces ChangeIMTI, a large-scale interactive multi-task instruction dataset for remote sensing change understanding, and proposes ChangeVG, a vision-guided vision-language model with dual-granularity awareness for bi-temporal remote sensing images.

Details

Motivation: Existing remote sensing change understanding datasets lack deep understanding and interactions across diverse tasks like change captioning, counting, and localization. There's a need for comprehensive multi-task datasets and models that can handle these complementary tasks effectively.

Method: 1) Construct ChangeIMTI dataset covering four tasks: change captioning, binary change classification, change counting, and change localization. 2) Design ChangeVG model with vision-guided module featuring dual-branch architecture combining fine-grained spatial feature extraction with high-level semantic summarization. 3) Use enriched representations as auxiliary prompts to guide large vision-language models during instruction tuning for hierarchical cross-modal learning.

Result: The method outperforms the strongest baseline Semantic-CC by 1.39 points on the comprehensive S*m metric for change captioning task. Extensive experiments across all four tasks demonstrate superiority, with ablation studies validating critical components.

Conclusion: The proposed ChangeIMTI dataset and ChangeVG model effectively address limitations in remote sensing change understanding by providing comprehensive multi-task capabilities and hierarchical cross-modal learning through vision-guided architecture.

Abstract: Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method. The source code and associated data for this work are publicly available at Github.

[257] Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

Yifan Zhang, Wei Zhang, Chuangxin He, Zhonghua Miao, Junhui Hou

Main category: cs.CV

TL;DR: A new unsupervised online 3D instance segmentation framework that improves upon UNIT by using synthetic point cloud sequence generation, flexible temporal sampling, and dynamic-weighting loss, achieving better performance on major datasets.

Details

Motivation: Existing unsupervised online 3D instance segmentation methods like UNIT have limitations: limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. There's a need for better methods that can maintain consistent object identities across LiDAR scans without annotated data.

Method: 1) Synthetic point cloud sequence generation to enrich training distribution without manual labels or simulation engines; 2) Flexible temporal sampling strategy using both adjacent and non-adjacent frames to capture long-range dependencies and short-term variations; 3) Dynamic-weighting loss that emphasizes confident and informative samples for more robust representations.

Result: The method consistently outperforms UNIT and other unsupervised baselines on SemanticKITTI, nuScenes, and PandaSet datasets, achieving higher segmentation accuracy and more robust temporal associations.

Conclusion: The proposed framework effectively addresses limitations of existing unsupervised 3D instance segmentation methods through synthetic data generation, flexible temporal modeling, and adaptive loss weighting, demonstrating superior performance across multiple benchmark datasets.

Abstract: Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

[258] Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Alexander Goslin

Main category: cs.CV

TL;DR: Terrain Diffusion: A generative framework using diffusion models for infinite, seamless terrain generation with procedural noise-like properties but higher realism.

Details

Motivation: Procedural noise functions (like Perlin noise) have been standard for decades but are fundamentally limited in realism and large-scale coherence. The paper aims to bridge diffusion model fidelity with procedural noise's essential properties: seamless infinite extent, seed-consistency, and constant-time random access.

Method: Introduces InfiniteDiffusion algorithm for infinite generation on unbounded domains, hierarchical stack of diffusion models to couple planetary context with local detail, compact Laplacian encoding to stabilize outputs across Earth-scale dynamic ranges, and open-source infinite-tensor framework for constant-memory manipulation of unbounded tensors.

Result: The framework outpaces orbital velocity by 9 times on a consumer GPU, enabling realistic terrain generation at interactive rates while maintaining procedural noise properties.

Conclusion: Terrain Diffusion positions diffusion models as a practical, scalable foundation for the next generation of infinite virtual worlds, combining diffusion model fidelity with procedural noise’s essential properties.

Abstract: For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, a generative framework that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation that reformulates standard diffusion sampling for unbounded domains. While noise functions remain near-instant, our framework outpaces orbital velocity by 9 times on a consumer GPU, enabling realistic terrain generation at interactive rates. We integrate a hierarchical stack of diffusion models to couple planetary context with local detail, a compact Laplacian encoding to stabilize outputs across Earth-scale dynamic ranges, and an open-source infinite-tensor framework for constant-memory manipulation of unbounded tensors. Together, these components position diffusion models as a practical, scalable foundation for the next generation of infinite virtual worlds.

[259] Bringing The Consistency Gap: Explicit Structured Memory for Interleaved Image-Text Generation

Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

Main category: cs.CV

TL;DR: IUT-Plug: Neuro-symbolic structured state tracking mechanism using Image Understanding Trees to prevent multimodal context drift in VLMs during extended image-text interactions.

Details

Motivation: Existing Vision Language Models struggle with preserving logic, entity identity, and artistic style during extended interleaved image-text interactions due to "Multimodal Context Drift" - the decay or entanglement of implicit neural representations over long sequences.

Method: Proposes IUT-Plug, a model-agnostic framework with Image Understanding Tree (IUT) as explicit persistent memory. It parses visual scenes into hierarchical symbolic structures (entities, attributes, relationships), performs incremental state updates to lock invariant properties while modifying changing elements, and guides generation through topological constraints.

Result: Evaluated on novel benchmark of 3,000 human-annotated samples. IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines.

Conclusion: Explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation, confirming the value of neuro-symbolic approaches over purely neural methods.

Abstract: Existing Vision Language Models (VLMs) often struggle to preserve logic, entity identity, and artistic style during extended, interleaved image-text interactions. We identify this limitation as “Multimodal Context Drift”, which stems from the inherent tendency of implicit neural representations to decay or become entangled over long sequences. To bridge this gap, we propose IUT-Plug, a model-agnostic Neuro-Symbolic Structured State Tracking mechanism. Unlike purely neural approaches that rely on transient attention maps, IUT-Plug introduces the Image Understanding Tree (IUT) as an explicit, persistent memory module. The framework operates by (1) parsing visual scenes into hierarchical symbolic structures (entities, attributes, and relationships); (2) performing incremental state updates to logically lock invariant properties while modifying changing elements; and (3) guiding generation through topological constraints. We evaluate our approach on a novel benchmark comprising 3,000 human-annotated samples. Experimental results demonstrate that IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines. This confirms that explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation.

[260] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

Main category: cs.CV

TL;DR: CoT-PL introduces visual chain-of-thought reasoning for open-vocabulary object detection, improving pseudo-label quality in crowded/occluded scenes by decomposing object understanding into three interpretable steps with contrastive background learning.

Details

Motivation: Current open-vocabulary detection methods rely too heavily on direct image-text matching, neglecting intermediate reasoning steps needed for complex scenes, leading to limited robustness in crowded or occluded contexts.

Method: CoT-PL framework uses structured visual chain-of-thought reasoning in pseudo-labeling with three steps: region perception for unseen objects, category recognition via zero-shot reasoning, and background grounding. Includes contrastive background learning (CBL) that uses background cues as negatives for feature disentanglement.

Result: Achieves 103.4% and 168.4% relative improvements in novel-class pseudo-label quality over prior methods in crowded/occluded scenes. Gains +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting new state-of-the-art.

Conclusion: CoT-PL demonstrates that structured visual reasoning significantly improves open-vocabulary detection robustness, especially in challenging crowded/occluded scenarios, through interpretable decomposition and contrastive background learning.

Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art. Code and models are available at https://github.com/hchoi256/cotpl.

[261] Space Object Detection using Multi-frame Temporal Trajectory Completion Method

Xiaoqing Lan, Biqiao Xin, Bingshu Wang, Han Zhang, Rui Zhu, Laixian Zhang

Main category: cs.CV

TL;DR: Proposes a wavelet-based single-frame enhancement and Hungarian algorithm-based multi-frame trajectory completion method for GEO space object detection in optical imaging.

Details

Motivation: GEO space objects are difficult to detect in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference, requiring improved detection methods.

Method: Uses wavelet transform for single-frame high-frequency feature enhancement and background noise suppression, then applies Hungarian algorithm for multi-frame trajectory completion with post-processing steps including temporal matching, interpolation, noise filtering, and trajectory refinement.

Result: Achieves 90.14% F1 score on the public SpotGEO dataset, demonstrating effectiveness of the proposed approach.

Conclusion: The proposed method effectively addresses GEO space object detection challenges through combined single-frame enhancement and multi-frame trajectory completion, achieving high performance on benchmark datasets.

Abstract: Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.

[262] VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng

Main category: cs.CV

TL;DR: VADTree: A training-free video anomaly detection method using hierarchical granularity-aware tree structure for flexible temporal sampling, leveraging pre-trained GEBD model for event boundary detection and VLMs/LLMs for anomaly reasoning.

Details

Motivation: Current training-free VAD methods use fixed-length temporal windows that struggle to capture anomalies with varying temporal spans. Supervised methods require substantial training data and lack clear anomaly explanations.

Method: Proposes VADTree with Hierarchical Granularity-aware Tree (HGTree) structure. Uses pre-trained GEBD model to detect event boundaries, decomposes video into generic event nodes, performs adaptive coarse-fine hierarchical structuring with redundancy removal. Injects multi-dimensional priors into VLMs for node-wise anomaly perception, uses LLMs for anomaly reasoning, and integrates multi-granularity anomaly scores via inter-cluster node correlation.

Result: Achieves state-of-the-art performance in training-free settings on three challenging datasets while drastically reducing the number of sampled video segments.

Conclusion: VADTree effectively addresses the limitation of fixed-length temporal sampling in VAD by providing flexible hierarchical granularity-aware structure, enabling accurate anomaly detection across varying temporal spans without requiring training data.

Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.

[263] Towards Generalisable Foundation Models for Brain MRI

Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander

Main category: cs.CV

TL;DR: BrainFound is a self-supervised foundation model for 3D brain MRI that extends DINO-v2 to handle volumetric data, supporting multimodal inputs and outperforming existing methods in label-scarce settings.

Details

Motivation: Foundation models are transforming medical imaging, but existing approaches often focus on 2D natural images or single-slice paradigms, lacking proper adaptation for 3D brain MRI data and multimodal clinical scenarios.

Method: Extends DINO-v2 vision transformer to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices. Supports both single- and multimodal inputs (e.g., T1, T2, FLAIR) and enables various downstream tasks like disease detection and segmentation.

Result: Consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. Enhances diagnostic accuracy and reduces dependency on extensive expert annotations.

Conclusion: BrainFound provides a scalable and practical solution for 3D neuroimaging pipelines with significant potential for clinical deployment and research innovation, offering flexibility across varied imaging protocols and clinical scenarios.

Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.

[264] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

Main category: cs.CV

TL;DR: EgoMAN: A new dataset and model for 3D hand trajectory prediction that links semantic reasoning with motion generation through a trajectory-token interface.

Details

Motivation: Prior 3D hand trajectory prediction works are limited by datasets that separate motion from semantic supervision, and models that weakly connect reasoning with action. There's a need for better integration of semantic understanding and motion prediction.

Method: 1) Created EgoMAN dataset with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. 2) Developed EgoMAN model - a reasoning-to-motion framework that connects vision-language reasoning with motion generation via trajectory-token interface, trained progressively to align reasoning with motion dynamics.

Result: The approach produces accurate and stage-aware trajectories with generalization across real-world scenes, addressing the limitations of previous methods.

Conclusion: The EgoMAN framework successfully bridges the gap between semantic reasoning and motion generation for 3D hand trajectory prediction, enabling more context-aware and accurate trajectory forecasting.

Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

[265] Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann

Main category: cs.CV

TL;DR: Proposes a NAS search space for hybrid CNN-ViT architectures optimized for tinyML deployment, achieving better accuracy and speed than ResNet models under size constraints.

Details

Motivation: Hybrid CNN-ViT architectures outperform pure CNN or ViT but are too large/computationally expensive for tinyML deployment. Need efficient hybrid models that fit tight size constraints.

Method: Introduces a new Neural Architecture Search (NAS) search space covering hybrid CNN and ViT blocks for local/global information, plus novel searchable Pooling blocks for efficient feature map reduction.

Result: On CIFAR10, the proposed search space produces hybrid CNN-ViT architectures with superior accuracy and inference speed compared to ResNet-based tinyML models under tight model size constraints.

Conclusion: The proposed NAS search space enables discovery of efficient hybrid CNN-ViT architectures suitable for tinyML deployment, balancing accuracy and computational efficiency.

Abstract: Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

[266] FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI

Hao Li, Zhenfeng Zhuang, Jingyu Lin, Yu Liu, Yifei Chen, Qiong Peng, Lequan Yu, Liansheng Wang

Main category: cs.CV

TL;DR: FDP framework improves brain MRI anomaly detection by leveraging frequency-domain analysis to suppress pathology while preserving anatomy, achieving 17.63% DICE score improvement with LDM.

Details

Motivation: Supervised anomaly detection for brain MRI is challenging due to anatomical diversity and scarce annotated data. Current unsupervised methods use artificial noise perturbations that lack biophysical fidelity and morphological complexity of real clinical lesions.

Method: Frequency-Decomposition Preprocessing (FDP) framework based on systematic frequency-domain analysis revealing that anomalies have unique frequency patterns distinguishable from normal anatomy, and low-frequency signals maintain consistent representations across healthy scans. FDP leverages frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation.

Result: FDP consistently improves anomaly detection performance when integrated with existing methods, achieving 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines.

Conclusion: FDP is the first UAD method to leverage frequency-domain reconstruction for pathology suppression and anatomical preservation, can seamlessly integrate with existing anomaly simulation techniques, and consistently enhances detection performance across diverse architectures while maintaining diagnostic fidelity.

Abstract: Due to the diversity of brain anatomy and the scarcity of annotated data, supervised anomaly detection for brain MRI remains challenging, driving the development of unsupervised anomaly detection (UAD) approaches. Current UAD methods typically utilize artificially generated noise perturbations on healthy MRIs to train generative models for normal anatomy reconstruction, enabling anomaly detection via residual maps. However, such simulated anomalies lack the biophysical fidelity and morphological complexity characteristic of true clinical lesions. To advance UAD in brain MRI, we conduct the first systematic frequency-domain analysis of pathological signatures, revealing two key properties: (1) anomalies exhibit unique frequency patterns distinguishable from normal anatomy, and (2) low-frequency signals maintain consistent representations across healthy scans. These insights motivate our Frequency-Decomposition Preprocessing (FDP) framework, the first UAD method to leverage frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. FDP can integrate seamlessly with existing anomaly simulation techniques, consistently enhancing detection performance across diverse architectures while maintaining diagnostic fidelity. Experimental results demonstrate that FDP consistently improves anomaly detection performance when integrated with existing methods. Notably, FDP achieves a 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines. The code is available at https://github.com/ls1rius/MRI_FDP.

[267] Inference-based GAN Video Generation

Jingbo Yang, Adrian G. Bors

Main category: cs.CV

TL;DR: Proposes a VAE-GAN hybrid model with Markov chain recall mechanism for generating long, coherent videos of hundreds/thousands of frames while maintaining temporal continuity.

Details

Motivation: Existing video generation models (GANs, VAEs, Diffusion Networks) struggle with generating long sequences beyond 16 frames, suffering from degraded quality and lack of meaningful scene successions when scaling temporally.

Method: 1) Proposes VAE-GAN hybrid structure with content and movement branches; 2) Extends with Markov chain framework where each state represents a short VAE-GAN generator; 3) Uses recall mechanism to sequentially connect generated sub-sequences while maintaining temporal dependencies.

Result: Enables generation of long videos composed of hundreds or thousands of frames with temporal continuity, consistency, and dynamics, overcoming limitations of existing models.

Conclusion: The proposed memory-efficient approach successfully generates meaningful long video sequences by leveraging Markov chain recall mechanism with VAE-GAN hybrid generators, addressing the temporal scaling challenge in video generation.

Abstract: Video generation has seen remarkable progress thanks to advancements in generative deep learning. However, generating long sequences remains a significant challenge. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Models such as GANs, VAEs, and Diffusion Networks have been used for generating short video sequences, typically up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. Classical approaches often result in degraded video quality when attempting to increase the generated video length, especially for significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, where each state represents a short-length VAE-GAN video generator. This setup enables the sequential connection of generated video sub-sequences, maintaining temporal dependencies and resulting in meaningful long video sequences.

Mohammad Romani

Main category: cs.CV

TL;DR: ForensicFlow: A multi-branch deepfake detection framework that fuses evidence from global visual inconsistencies, fine-grained texture anomalies, and spectral noise patterns using attention-based temporal pooling and adaptive fusion, achieving state-of-the-art performance on CelebDF(v2).

Details

Motivation: Modern deepfakes create subtle, domain-specific artifacts that single-branch networks often miss, requiring a more comprehensive approach that examines multiple forensic dimensions simultaneously to detect sophisticated forgeries.

Method: Three-branch architecture: 1) ConvNeXt-tiny for global visual inconsistencies, 2) Swin Transformer-tiny for fine-grained texture anomalies, 3) CNN with channel attention for spectral noise patterns. Uses attention-based temporal pooling to prioritize high-evidence frames and adaptive fusion to weight each branch according to forgery type. Trained on CelebDF(v2) with Focal Loss.

Result: Achieves AUC 0.9752, F1 0.9408, and accuracy 0.9208 on CelebDF(v2), outperforming single-stream detectors. Ablation studies confirm branch synergy, and Grad-CAM visualizations validate focus on genuine manipulation regions like facial boundaries.

Conclusion: The multi-domain fusion strategy establishes robustness against increasingly sophisticated forgeries by comprehensively analyzing multiple forensic dimensions, demonstrating that combining global, texture, and spectral evidence significantly improves deepfake detection performance.

Abstract: Modern deepfakes evade detection by leaving subtle, domain-speci c artifacts that single branch networks miss. ForensicFlow addresses this by fusing evidence across three forensic dimensions: global visual inconsistencies (via ConvNeXt-tiny), ne-grained texture anomalies (via Swin Transformer-tiny), and spectral noise patterns (via CNN with channel attention). Our attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive fusion weights each branch according to forgery type. Trained on CelebDF(v2) with Focal Loss, the model achieves AUC 0.9752, F1 0.9408, and accuracy 0.9208 out performing single-stream detectors. Ablation studies con rm branch synergy, and Grad-CAM visualizations validate focus on genuine manipulation regions (e.g., facial boundaries). This multi-domain fusion strategy establishes robustness against increasingly sophisticated forgeries.

[269] BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction

Zhengsen Xu, Sibo Cheng, Lanying Wang, Hongjie He, Wentao Sun, Jonathan Li, Lincoln Linlin Xu

Main category: cs.CV

TL;DR: A 25-year daily wildfire dataset with 38 multimodal covariates covering 240M hectares in British Columbia, used to benchmark time-series forecasting models including CNN, linear, Transformer, and Mamba architectures.

Details

Motivation: Wildfire risk prediction is challenging due to complex interactions among fuel, weather, topography, and human factors. There's a scarcity of publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multimodal drivers.

Method: Created a comprehensive 25-year daily-resolution wildfire dataset covering 240 million hectares across British Columbia and surrounding regions with 38 covariates including active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Used this benchmark to evaluate diverse time-series forecasting models (CNN-based, linear-based, Transformer-based, and Mamba-based architectures) and investigated position embedding effectiveness and relative importance of different fire-driving factors.

Result: The dataset and corresponding code are publicly available at https://github.com/SynUW/mmFire, providing a benchmark for wildfire risk prediction research.

Conclusion: The paper addresses the critical gap in wildfire prediction benchmarks by providing a comprehensive multimodal dataset and evaluating state-of-the-art time-series forecasting models, enabling better understanding of fire-driving factors and advancing data-driven wildfire risk prediction.

Abstract: Wildfire risk prediction remains a critical yet challenging task due to the complex interactions among fuel conditions, meteorology, topography, and human activity. Despite growing interest in data-driven approaches, publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multimodal drivers remain scarce. To address this gap, we present a 25-year, daily-resolution wildfire dataset covering 240 million hectares across British Columbia and surrounding regions. The dataset includes 38 covariates, encompassing active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Using this benchmark, we evaluate a diverse set of time-series forecasting models, including CNN-based, linear-based, Transformer-based, and Mamba-based architectures. We also investigate effectiveness of position embedding and the relative importance of different fire-driving factors. The dataset and the corresponding code can be found at https://github.com/SynUW/mmFire

[270] SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems

Khalfalla Awedat, Mohamed Abidalrekab, Gurcan Comert, Mustafa Ayad

Main category: cs.CV

TL;DR: SuperiorGAT is a graph attention framework that reconstructs missing elevation data in sparse LiDAR point clouds using beam-aware graphs and gated residual fusion, achieving better reconstruction than existing methods without increasing network complexity.

Details

Motivation: LiDAR perception in autonomous systems faces limitations from fixed vertical beam resolution and beam dropout due to environmental occlusions, creating sparse point clouds that compromise perception accuracy.

Method: Models LiDAR scans as beam-aware graphs, uses graph attention networks with gated residual fusion and feed-forward refinement to reconstruct missing elevation information without increasing network depth.

Result: Achieves lower reconstruction error and better geometric consistency than PointNet-based models and deeper GAT baselines across diverse KITTI environments, with qualitative X-Z projections showing preserved structural integrity and minimal vertical distortion.

Conclusion: Architectural refinement provides a computationally efficient way to improve LiDAR resolution without requiring additional sensor hardware, offering a practical solution for autonomous perception systems.

Abstract: LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model’s ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.

[271] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

Main category: cs.CV

TL;DR: ∞-RoPE is a training-free framework that enables infinite-horizon video generation with fine-grained action control and cinematic scene transitions by addressing three core limitations of current autoregressive video diffusion models.

Details

Motivation: Current autoregressive video diffusion models suffer from three key limitations: (1) finite temporal horizon due to 3D-RoPE constraints, (2) slow prompt responsiveness for maintaining fine-grained action control during long rollouts, and (3) inability to create discontinuous cinematic transitions within single generations.

Method: ∞-RoPE introduces three interconnected components: Block-Relativistic RoPE (reformulates temporal encoding as moving local reference frames), KV Flush (renews KV cache by retaining only two latent frames for immediate prompt responsiveness), and RoPE Cut (introduces controlled discontinuities for multi-cut scene transitions).

Result: The framework enables continuous video generation beyond positional limits, maintains fine-grained action control without re-encoding, and allows cinematic scene transitions. Comprehensive experiments show ∞-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

Conclusion: ∞-RoPE establishes a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion by addressing three core bottlenecks of current autoregressive video diffusion models through a unified inference-time framework.

Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model’s 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model’s maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

[272] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

Wenyuan Huang, Zhao Wang, Zhou Wei, Ting Huang, Fang Zhao, Jian Yang, Zhenyu Zhang

Main category: cs.CV

TL;DR: OpenGround is a zero-shot framework for open-world 3D visual grounding that overcomes limitations of pre-defined object lookup tables through active cognition-based reasoning.

Details

Motivation: Existing 3D visual grounding methods rely on pre-defined Object Lookup Tables (OLTs) to query VLMs, which limits applications in scenarios with undefined or unforeseen targets. This restricts the ability to handle open-world scenarios.

Method: Proposes OpenGround with an Active Cognition-based Reasoning (ACR) module that progressively augments VLM cognitive scope through a cognitive task chain. It performs human-like perception of targets and actively reasons about contextually relevant objects via a dynamically updated OLT.

Result: Achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers 17.6% improvement on their new OpenTarget dataset containing 7000+ object-description pairs for open-world evaluation.

Conclusion: OpenGround enables 3D visual grounding in both pre-defined and open-world categories by overcoming fundamental limitations of static OLTs through active cognition-based reasoning, significantly advancing open-world 3D scene understanding.

Abstract: 3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at https://why-102.github.io/openground.io/.

[273] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang, Yadan Luo

Main category: cs.CV

TL;DR: TALO: A plug-and-play Thin Plate Spline-based framework for long-term temporal consistency in 3D vision foundation models, addressing spatial inconsistencies in online driving scenarios.

Details

Motivation: 3D vision foundation models lack temporal consistency in online settings like driving scenarios, where predictions over temporal windows need alignment. Existing methods have limitations in assumption validity, local alignment scope, and robustness to noisy geometry.

Method: Proposes a higher-DOF alignment framework using Thin Plate Spline with globally propagated control points to correct spatially varying inconsistencies. Uses point-agnostic submap registration design for robustness to noisy geometry predictions.

Result: Extensive experiments show consistent improvements in geometry coherence and lower trajectory errors across multiple datasets, backbone models, and camera setups (monocular/surround-view).

Conclusion: TALO provides a robust, general plug-and-play solution for temporal consistency in 3D vision foundation models, compatible with diverse models and camera configurations.

Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at https://github.com/Xian-Bei/TALO.

[274] SoulX-LiveTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

Le Shen, Qiao Qian, Tan Yu, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Ming Tao, Shunshun Yin, Siyuan Liu

Main category: cs.CV

TL;DR: 14B-parameter framework for real-time audio-driven avatar generation achieves sub-second latency and 32 FPS throughput using bidirectional attention distillation and self-correction mechanisms.

Details

Motivation: Existing approaches for real-time audio-driven avatar generation compromise visual fidelity by using strictly unidirectional attention or reducing model capacity, creating a conflict between computational load and latency constraints.

Method: Proposes SoulX-LiveTalk with: 1) Self-correcting Bidirectional Distillation that retains bidirectional attention within video chunks, 2) Multi-step Retrospective Self-Correction Mechanism for error recovery during infinite generation, and 3) full-stack inference acceleration with hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations.

Result: Achieves sub-second start-up latency (0.87s) and real-time throughput of 32 FPS, making it the first 14B-scale system to reach these performance levels for high-fidelity interactive digital human synthesis.

Conclusion: SoulX-LiveTalk sets a new standard for real-time audio-driven avatar generation by balancing computational efficiency with visual fidelity through innovative bidirectional attention preservation and self-correction mechanisms.

Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

[275] DiRe: Diversity-promoting Regularization for Dataset Condensation

Saumyaranjan Mohanty, Aravind Reddy, Konda Reddy Mopuri

Main category: cs.CV

TL;DR: Proposes Diversity Regularizer (DiRe) for dataset condensation to reduce redundancy and improve diversity in synthesized datasets, enhancing existing state-of-the-art methods.

Details

Motivation: Existing dataset condensation methods produce synthesized datasets with significant redundancy, creating a need to reduce redundancy and improve diversity in the condensed datasets.

Method: Introduces an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance metrics that can be applied off-the-shelf to various state-of-the-art condensation methods.

Result: Extensive experiments show that adding DiRe improves state-of-the-art condensation methods on benchmark datasets from CIFAR-10 to ImageNet-1K, enhancing both generalization and diversity metrics.

Conclusion: The proposed Diversity Regularizer effectively addresses redundancy issues in dataset condensation and can be easily integrated with existing methods to improve performance across various datasets.

Abstract: In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.

[276] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

Main category: cs.CV

TL;DR: Training-free approach improves subject consistency in text-to-image diffusion models by refining text embeddings to suppress semantic entanglement, outperforming existing baselines without fine-tuning.

Details

Motivation: Text-to-image diffusion models struggle with subject consistency across multiple outputs for visual storytelling. Existing approaches require computationally expensive fine-tuning or per-subject optimization, while training-free methods like 1Prompt1Story suffer from semantic leakage and text misalignment.

Method: Proposes a simple training-free approach that refines text embeddings from a geometric perspective to suppress unwanted semantics and address semantic entanglement, without requiring model fine-tuning or per-subject optimization.

Result: Extensive experiments show the approach significantly improves both subject consistency and text alignment over existing baselines, demonstrating effectiveness in maintaining subject coherence across generated images.

Conclusion: The geometric refinement of text embeddings provides an effective training-free solution to semantic entanglement in text-to-image generation, enabling better subject consistency for visual storytelling applications.

Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[277] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: RxnBench is a new benchmark for evaluating Multimodal Large Language Models on chemical reaction understanding from scientific PDFs, revealing significant gaps in models’ ability to comprehend chemical logic and integrate information across modalities.

Details

Motivation: Current MLLMs are being integrated into chemistry but their ability to understand the dense graphical language of chemical reactions in real scientific literature remains underexplored. There's a need to rigorously evaluate how well these models can comprehend chemical reactions from authentic PDF documents.

Method: Created RxnBench, a multi-tiered benchmark with two tasks: 1) Single-Figure QA (SF-QA) with 1,525 questions from 305 curated reaction schemes testing visual perception and mechanistic reasoning, and 2) Full-Document QA (FD-QA) with 108 articles requiring cross-modal integration of text, schemes, and tables.

Result: MLLMs show a critical capability gap - they excel at extracting explicit text but struggle with deep chemical logic and precise structural recognition. Models with inference-time reasoning significantly outperform standard architectures, but none achieve 50% accuracy on FD-QA.

Conclusion: There’s an urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists, as current MLLMs lack the deep chemical understanding required for authentic scientific literature comprehension.

Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[278] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

Main category: cs.CV

TL;DR: DAVE is a specialized vision encoder for VLMs designed to address the lack of robust structural/spatial information in existing encoders for document understanding and web agent tasks, using self-supervised pretraining followed by supervised autoregressive training with novel merging and ensemble techniques.

Details

Motivation: Current vision-language models have a fundamental weakness: their vision encoders lack robust structural and spatial information essential for document understanding and web agent tasks, creating a gap that needs to be addressed.

Method: Two-stage training pipeline: 1) Self-supervised pretraining on unlabeled images, 2) Supervised autoregressive pretraining on limited high-quality data for parsing/localization. Uses novel model-merging scheme to combine encoders trained with different text decoders, and ensemble training to fuse features from generalist encoders (SigLIP2) with document/web-specific representations.

Result: Extensive experiments on document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness, establishing DAVE as a strong vision encoder for document and web applications.

Conclusion: DAVE successfully bridges the gap in vision encoders for VLMs, providing robust structural and spatial information essential for document understanding and web agent tasks through innovative training approaches.

Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder’s alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

[279] ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo, Cen Chen

Main category: cs.CV

TL;DR: ProCache: A training-free dynamic feature caching framework for Diffusion Transformers that achieves 1.96-2.90x acceleration with minimal quality loss by using non-uniform caching intervals and selective computation.

Details

Motivation: Diffusion Transformers (DiTs) have state-of-the-art generative performance but high computational costs hinder real-time deployment. Existing feature caching methods use uniform intervals that don't match DiT's non-uniform temporal dynamics, and naive feature reuse with large intervals causes severe error accumulation.

Method: ProCache has two core components: (1) constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to DiT’s temporal characteristics; (2) selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead.

Result: Extensive experiments on PixArt-alpha and DiT show ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

Conclusion: ProCache effectively addresses limitations of existing feature caching methods by aligning with DiT’s non-uniform temporal dynamics and mitigating error accumulation through selective computation, enabling efficient real-time deployment of Diffusion Transformers.

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

[280] GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

Tiantian Li, Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He, Jun Zhang, Yan Wang

Main category: cs.CV

TL;DR: GaussianImage++ improves 2D Gaussian Splatting for image representation/compression with distortion-driven densification, context-aware filters, and efficient quantization, outperforming previous methods with real-time decoding.

Details

Motivation: Implicit neural representations (INRs) require heavy training/memory, while existing 2D Gaussian Splatting methods need too many primitives for high quality. Need to exploit GS potential with limited primitives.

Method: Three key innovations: 1) Distortion-driven densification allocating primitives based on signal intensity; 2) Context-aware Gaussian filters optimizing primitives for different image content; 3) Attribute-separated learnable scalar quantizers with quantization-aware training for compression.

Result: Outperforms GaussianImage and INRs-based COIN in both representation and compression performance while maintaining real-time decoding and low memory usage.

Conclusion: GaussianImage++ successfully leverages limited Gaussian primitives to achieve superior image representation and compression, demonstrating the potential of optimized Gaussian Splatting approaches.

Abstract: Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.

[281] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Zhenhao Li, Shaohan Yi, Zheng Liu, Leonartinus Gao, Minh Ngoc Le, Ambrose Ling, Zhuoran Wang, Md Amirul Islam, Zhixiang Chi, Yuanhao Yu

Main category: cs.CV

TL;DR: MIVA is a lightweight modular adapter system for diffusion models that enables precise motion control in image-to-video generation using minimal training data.

Details

Motivation: Diffusion models struggle with image animation due to video data scarcity causing memorization over prompt compliance, and poor generalization to novel motion patterns with limited training data.

Method: Proposes Modular Image-to-Video Adapter (MIVA) - lightweight sub-networks attachable to pre-trained DMs, each capturing a single motion pattern, trainable on ~10 samples with consumer GPUs, scalable via parallelization.

Result: MIVA enables more precise motion control while maintaining or surpassing generation quality of models trained on much larger datasets, without needing prompt engineering.

Conclusion: MIVA addresses key limitations of diffusion models for image animation through modular, data-efficient adapters that provide precise motion control with minimal training requirements.

Abstract: Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.

[282] SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration

Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang

Main category: cs.CV

TL;DR: SlideChain is a blockchain-based framework for verifying semantic extraction from educational slides using VLMs, addressing reproducibility and auditability issues in AI-generated instructional content.

Details

Motivation: VLMs are increasingly used for educational content but their semantic outputs are hard to verify, reproduce, and audit, especially in high-stakes STEM domains where inconsistencies across models and environments undermine reliability.

Method: Developed SlideChain framework using a curated dataset of 1,117 medical imaging lecture slides. Extracted concepts and relational triples from four state-of-the-art VLMs, created structured provenance records, and anchored cryptographic hashes on a local EVM-compatible blockchain for tamper-evident auditability.

Result: Revealed significant cross-model discrepancies with low concept overlap and near-zero agreement in relational triples. Demonstrated perfect tamper detection, deterministic reproducibility, and evaluated gas usage, throughput, and scalability under simulated deployment conditions.

Conclusion: SlideChain provides a practical, scalable solution for trustworthy multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

Abstract: Modern vision–language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

[283] SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: SyncAnyone: A two-stage diffusion framework for high-quality AI video dubbing that improves lip-sync accuracy while maintaining facial structure and background consistency.

Details

Motivation: Existing mask-based training methods for video dubbing disrupt spatiotemporal context, causing instability in facial structure, dynamic motions, and background consistency despite achieving lip-sync accuracy.

Method: Two-stage learning framework: Stage 1 trains diffusion-based video transformer for masked mouth inpainting with audio-driven lip movements; Stage 2 uses mask-free tuning with pseudo-paired training samples from synthesized lip-synced videos to address mask-induced artifacts.

Result: Achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation in in-the-wild lip-syncing scenarios.

Conclusion: SyncAnyone successfully overcomes limitations of mask-based methods by combining accurate motion modeling with high visual fidelity through a novel two-stage approach.

Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

[284] Tracking by Predicting 3-D Gaussians Over Time

Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik

Main category: cs.CV

TL;DR: Video-GMAE is a self-supervised method that represents videos as moving Gaussian splats, enabling emergent tracking capabilities and achieving state-of-the-art performance on video understanding tasks.

Details

Motivation: The paper aims to develop a self-supervised video representation learning approach that leverages the inherent 3D structure of dynamic scenes. The key insight is that 2D videos often represent consistent projections of 3D scenes, so representing videos as sets of Gaussians moving over time provides a reasonable inductive bias for learning meaningful representations.

Method: Video-GMAE encodes video sequences into sets of Gaussian splats that move over time. The method uses a masked autoencoder architecture where the network learns to reconstruct masked portions of the video by predicting the Gaussian representations. This representation naturally encourages the model to learn tracking behavior as Gaussians follow objects through the video sequence.

Result: The method demonstrates emergent tracking capabilities - mapping Gaussian trajectories onto the image plane yields zero-shot tracking performance comparable to state-of-the-art methods. With finetuning, Video-GMAE achieves 34.6% improvement on Kinetics and 13.1% improvement on Kubric datasets, surpassing existing self-supervised video approaches.

Conclusion: Representing videos as moving Gaussian splats provides an effective inductive bias for self-supervised video representation learning, enabling emergent tracking capabilities and achieving superior performance on video understanding tasks compared to existing methods.

Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.

[285] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

ZhenQi Chen, TsaiChing Ni, YuanFu Yang

Main category: cs.CV

TL;DR: CritiFusion improves text-to-image diffusion models by adding inference-time semantic critique and frequency-domain refinement without retraining, achieving better prompt alignment and visual quality.

Details

Motivation: Current text-to-image diffusion models have high visual fidelity but often fail to maintain semantic alignment with complex prompts, needing better prompt understanding and detail preservation.

Method: Two-stage approach: 1) CritiCore module uses vision-language and large language models to provide semantic feedback and enrich prompt context; 2) SpecFusion merges intermediate generation states in spectral domain to preserve structural information and high-frequency details. Works as plug-in refinement without additional training.

Result: Significantly improves text-to-image correspondence and visual quality metrics on standard benchmarks. Matches state-of-the-art reward optimization approaches in human preference scores and aesthetic evaluations. Shows superior detail, realism, and prompt fidelity in qualitative results.

Conclusion: CritiFusion effectively addresses semantic alignment issues in text-to-image generation through semantic critique and spectral alignment, offering a practical plug-in solution compatible with existing diffusion models.

Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt’s intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

[286] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, Yonghua Lin

Main category: cs.CV

TL;DR: Adaptive visual token pruning method for Large Multimodal Models that reduces up to 80% of visual tokens while maintaining performance in long context, multi-image settings.

Details

Motivation: Existing visual token pruning methods overlook scenarios with long context inputs containing multiple images, where the growing number of visual tokens greatly increases inference costs for Large Multimodal Models.

Method: Two-stage adaptive pruning: 1) Intra-image stage allocates content-aware token budgets and greedily selects most representative tokens per image; 2) Inter-image stage performs global diversity filtering and Pareto selection balancing diversity with text alignment.

Result: Extensive experiments show the approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.

Conclusion: The proposed adaptive pruning method effectively addresses the challenges of visual token pruning in long context, multi-image scenarios by decomposing redundancy into intra-image and inter-image components and dynamically allocating token budgets.

Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.

[287] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li

Main category: cs.CV

TL;DR: ColaVLA is a vision-language-action framework for autonomous driving that transfers VLM reasoning to latent space and uses hierarchical parallel planning for efficient, safe trajectory generation.

Details

Motivation: Current VLM-based autonomous driving planners face challenges: mismatch between discrete text reasoning and continuous control, high latency from autoregressive decoding, and inefficient/non-causal planners limiting real-time deployment.

Method: Two main components: 1) Cognitive Latent Reasoner compresses scene understanding into decision-oriented meta-action embeddings via ego-adaptive selection with only two VLM forward passes; 2) Hierarchical Parallel Planner generates multi-scale, causality-consistent trajectories in a single forward pass.

Result: Achieves state-of-the-art performance on nuScenes benchmark in both open-loop and closed-loop settings with favorable efficiency and robustness.

Conclusion: ColaVLA preserves VLM generalization and interpretability while enabling efficient, accurate, and safe trajectory generation for autonomous driving through unified latent space reasoning and parallel planning.

Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng

Main category: cs.CV

TL;DR: PoseStreamer: A multi-modal 6DoF pose estimation framework using event cameras for high-speed moving objects, featuring temporal consistency, 2D tracking priors, and geometric refinement.

Details

Motivation: Standard RGB cameras struggle with motion blur in high-speed, low-light scenarios for 6DoF pose estimation. Event cameras offer high temporal resolution but current methods have suboptimal performance for fast-moving objects.

Method: Three core components: 1) Adaptive Pose Memory Queue for temporal consistency using historical orientation cues, 2) Object-centric 2D Tracker providing strong 2D priors to boost 3D center recall, 3) Ray Pose Filter for geometric refinement along camera rays. Also introduces MoCapCube6D dataset for benchmarking rapid motion.

Result: Achieves superior accuracy in high-speed moving scenarios and exhibits strong generalizability as a template-free framework for unseen moving objects.

Conclusion: PoseStreamer effectively addresses 6DoF pose estimation challenges in high-speed scenarios using event cameras, demonstrating robust performance and generalization capabilities for fast-moving objects.

Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

[289] YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Xu Lin, Jinlong Peng, Zhenye Gan, Jiawen Zhu, Jun Liu

Main category: cs.CV

TL;DR: YOLO-Master introduces instance-conditional adaptive computation for real-time object detection using Efficient Sparse Mixture-of-Experts to dynamically allocate resources based on scene complexity.

Details

Motivation: Current YOLO-like models use static dense computation that applies uniform processing to all inputs, causing computational redundancy on simple scenes and suboptimal performance on complex ones due to misallocated resources.

Method: Proposes YOLO-Master framework with Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources per input based on scene complexity. Uses lightweight dynamic routing network with diversity enhancing objective to encourage complementary expertise among experts and activate only relevant experts during inference.

Result: Achieves 42.4% AP with 1.62ms latency on MS COCO, outperforming YOLOv13-N by +0.8% mAP with 17.8% faster inference. Gains are most significant on challenging dense scenes while maintaining efficiency on typical inputs and real-time speed.

Conclusion: YOLO-Master successfully addresses computational redundancy in RTOD through adaptive computation, achieving superior performance-efficiency trade-off, especially for complex scenes, while maintaining real-time inference capabilities.

Abstract: Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

[290] Visual Language Hypothesis

Xiu Li

Main category: cs.CV

TL;DR: Visual representation learning requires semantic abstraction through topological transformation, not just smooth deformation, with specific architectural demands for discrete semantic state formation.

Details

Motivation: To understand visual representation learning from structural and topological perspectives, starting from the hypothesis that visual understanding requires a semantic language where many perceptual observations map to few discrete semantic states.

Method: Theoretical analysis using fiber bundle structure: visual observation space organized as fiber bundle (nuisance variation in fibers, semantics in quotient base space). Derives consequences about semantic invariance requiring non-homeomorphic discriminative targets and architectural demands for topology change.

Result: Two key theoretical findings: 1) Semantic quotient X/G is not a submanifold and cannot be obtained through smooth deformation alone - requires discriminative supervision; 2) Semantic abstraction demands specific architecture supporting “expand and snap” process for topology change.

Conclusion: Visual representation learning fundamentally requires topological transformation for semantic abstraction, aligning with empirical regularities in large-scale models and statistical learning theory principles. The framework provides interpretive topological lens rather than prescriptive guidance.

Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient X/G is not a submanifold of X and cannot be obtained through smooth deformation alone, semantic invariance requires a non homeomorphic, discriminative target for example, supervision via labels, cross-instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand and snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

[291] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: DriveLaW is a unified paradigm that integrates video generation (world modeling) and motion planning for autonomous driving, achieving state-of-the-art performance in both tasks through shared latent representations.

Details

Motivation: Current autonomous driving approaches treat world models and motion planning as decoupled processes, creating a gap between scenario prediction and trajectory planning. The authors aim to bridge this gap by creating inherent consistency between high-fidelity future generation and reliable trajectory planning.

Method: DriveLaW consists of two core components: DriveLaW-Video (a powerful world model that generates high-fidelity forecasting with expressive latent representations) and DriveLaW-Act (a diffusion planner that generates consistent trajectories from DriveLaW-Video’s latent representations). Both components are optimized using a three-stage progressive training strategy.

Result: DriveLaW achieves new state-of-the-art results: 1) Advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD; 2) Achieves a new record on the NAVSIM planning benchmark.

Conclusion: The unified paradigm of DriveLaW successfully bridges the gap between world modeling and motion planning in autonomous driving, demonstrating that direct injection of video generator latent representations into planners ensures inherent consistency and improves performance across both tasks.

Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

[292] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Kang Du, Yirui Guan, Zeyu Wang

Main category: cs.CV

TL;DR: IDT is a feed-forward transformer framework for multi-view intrinsic image decomposition that produces view-consistent diffuse reflectance, diffuse shading, and specular shading without iterative sampling.

Details

Motivation: RGB images entangle material properties, illumination, and view-dependent effects, making intrinsic decomposition fundamental for visual understanding. While recent diffusion-based methods work for single-view decomposition, they struggle with multi-view settings, leading to severe view inconsistency.

Method: IDT uses transformer-based attention to jointly reason over multiple input images in a single forward pass. It adopts a physically grounded image formation model that explicitly decomposes images into three components: diffuse reflectance, diffuse shading, and specular shading, separating Lambertian and non-Lambertian light transport.

Result: Experiments on synthetic and real-world datasets show IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, better-isolated specular components, and substantially improves multi-view consistency compared to prior methods.

Conclusion: IDT provides an effective feed-forward framework for multi-view intrinsic decomposition that produces interpretable and controllable material/illumination separation with strong view consistency, advancing beyond single-view approaches.

Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

cs.AI

[293] The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models

Rahul Baxi

Main category: cs.AI

TL;DR: The paper introduces DDFT (Drill-Down and Fabricate Test) to measure epistemic robustness - how well models maintain factual accuracy under semantic compression and adversarial fabrication, finding that robustness is orthogonal to model size/architecture and depends on verification mechanisms.

Details

Motivation: Current evaluations (MMLU, TruthfulQA) measure knowledge under ideal conditions but not robustness under stress. They can't distinguish models that lack knowledge from those whose verification mechanisms fail when information degrades or adversaries probe weaknesses.

Method: Introduces DDFT protocol measuring epistemic robustness through progressive semantic compression and adversarial fabrication. Proposes two-system cognitive model: Semantic System (generates text) and Epistemic Verifier (validates facts). Evaluated 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations).

Result: Epistemic robustness is orthogonal to conventional design paradigms: neither parameter count (r=0.083, p=0.832) nor architectural type (r=0.153, p=0.695) predicts robustness. Error detection capability strongly predicts overall robustness (rho=-0.817, p=0.007). Flagship models show brittleness despite scale, while smaller models can achieve robust performance.

Conclusion: Robustness emerges from training methodology and verification mechanisms distinct from current approaches. DDFT provides theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications, challenging assumptions about model size-reliability relationship.

Abstract: Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress. Static benchmarks like MMLU and TruthfulQA cannot distinguish a model that lacks knowledge from one whose verification mechanisms collapse when information degrades or adversaries probe for weaknesses. We introduce the Drill-Down and Fabricate Test (DDFT), a protocol that measures epistemic robustness: a model’s ability to maintain factual accuracy under progressive semantic compression and adversarial fabrication. We propose a two-system cognitive model comprising a Semantic System that generates fluent text and an Epistemic Verifier that validates factual accuracy. Our findings, based on evaluating 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations), reveal that epistemic robustness is orthogonal to conventional design paradigms. Neither parameter count (r=0.083, p=0.832) nor architectural type (r=0.153, p=0.695) significantly predicts robustness, suggesting it emerges from training methodology and verification mechanisms distinct from current approaches. Error detection capability strongly predicts overall robustness (rho=-0.817, p=0.007), indicating this is the critical bottleneck. We find that flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance, challenging assumptions about the relationship between model size and reliability. The DDFT framework provides both theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications.

[294] CASCADE: Cumulative Agentic Skill Creation through Autonomous Development and Evolution

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, Gerbrand Ceder

Main category: cs.AI

TL;DR: CASCADE is a self-evolving LLM agent framework that transitions from tool use to skill acquisition, enabling agents to master complex scientific tools through continuous learning and self-reflection, achieving 93.3% success on materials science/chemistry tasks.

Details

Motivation: Current LLM agents rely on predefined tools or brittle tool generation, limiting their capability and adaptability to complex scientific tasks. There's a need for agents that can evolve and acquire skills rather than just use tools.

Method: CASCADE enables agents to master complex external tools through two meta-skills: 1) continuous learning via web search and code extraction, and 2) self-reflection via introspection and knowledge graph exploration. The framework includes human-agent collaboration and memory consolidation.

Result: On SciSkillBench (116 materials science and chemistry tasks), CASCADE achieves 93.3% success rate using GPT-5, compared to 35.4% without evolution mechanisms. Demonstrated real-world applications in computational analysis, autonomous lab experiments, and paper reproduction.

Conclusion: CASCADE represents a transition from “LLM + tool use” to “LLM + skill acquisition”, accumulating executable skills that can be shared across agents and scientists, moving toward scalable AI-assisted scientific research.

Abstract: Large language model (LLM) agents currently depend on predefined tools or brittle tool generation, constraining their capability and adaptability to complex scientific tasks. We introduce CASCADE, a self-evolving agentic framework representing an early instantiation of the transition from “LLM + tool use” to “LLM + skill acquisition”. CASCADE enables agents to master complex external tools and codify knowledge through two meta-skills: continuous learning via web search and code extraction, and self-reflection via introspection and knowledge graph exploration, among others. We evaluate CASCADE on SciSkillBench, a benchmark of 116 materials science and chemistry research tasks. CASCADE achieves a 93.3% success rate using GPT-5, compared to 35.4% without evolution mechanisms. We further demonstrate real-world applications in computational analysis, autonomous laboratory experiments, and selective reproduction of published papers. Along with human-agent collaboration and memory consolidation, CASCADE accumulates executable skills that can be shared across agents and scientists, moving toward scalable AI-assisted scientific research.

[295] SCP: Accelerating Discovery with a Global Web of Autonomous Scientific Agents

Yankai Jiang, Wenjie Lou, Lilong Wang, Zhenyu Tang, Shiyang Feng, Jiaxuan Lu, Haoran Sun, Yaning Pan, Shuang Gu, Haoyang Su, Feng Liu, Wangxu Wei, Pan Tan, Dongzhan Zhou, Fenghua Ling, Cheng Tan, Bo Zhang, Xiaosong Wang, Lei Bai, Bowen Zhou

Main category: cs.AI

TL;DR: SCP (Science Context Protocol) is an open-source standard enabling global autonomous scientific agents through unified resource integration and orchestrated experiment lifecycle management.

Details

Motivation: To accelerate scientific discovery by enabling seamless collaboration between AI agents and human researchers across disparate platforms and institutional boundaries, reducing integration overhead and enhancing reproducibility.

Method: Built on two pillars: (1) Unified Resource Integration - a universal specification for describing/invoking scientific resources (tools, models, datasets, instruments); (2) Orchestrated Experiment Lifecycle Management - a secure service architecture with centralized SCP Hub and federated SCP Servers managing registration, planning, execution, monitoring, and archival.

Result: Developed a scientific discovery platform with over 1,600 tool resources; facilitates secure, large-scale collaboration between heterogeneous AI systems and human researchers; significantly reduces integration overhead and enhances reproducibility.

Conclusion: SCP establishes essential infrastructure for scalable, multi-institution, agent-driven science by standardizing scientific context and tool orchestration at the protocol level.

Abstract: We introduce SCP: the Science Context Protocol, an open-source standard designed to accelerate discovery by enabling a global network of autonomous scientific agents. SCP is built on two foundational pillars: (1) Unified Resource Integration: At its core, SCP provides a universal specification for describing and invoking scientific resources, spanning software tools, models, datasets, and physical instruments. This protocol-level standardization enables AI agents and applications to discover, call, and compose capabilities seamlessly across disparate platforms and institutional boundaries. (2) Orchestrated Experiment Lifecycle Management: SCP complements the protocol with a secure service architecture, which comprises a centralized SCP Hub and federated SCP Servers. This architecture manages the complete experiment lifecycle (registration, planning, execution, monitoring, and archival), enforces fine-grained authentication and authorization, and orchestrates traceable, end-to-end workflows that bridge computational and physical laboratories. Based on SCP, we have constructed a scientific discovery platform that offers researchers and agents a large-scale ecosystem of more than 1,600 tool resources. Across diverse use cases, SCP facilitates secure, large-scale collaboration between heterogeneous AI systems and human researchers while significantly reducing integration overhead and enhancing reproducibility. By standardizing scientific context and tool orchestration at the protocol level, SCP establishes essential infrastructure for scalable, multi-institution, agent-driven science.

[296] A Proof-of-Concept for Explainable Disease Diagnosis Using Large Language Models and Answer Set Programming

Ioanna Gemou, Evangelos Lamprou

Main category: cs.AI

TL;DR: McCoy framework combines LLMs and Answer Set Programming for automated disease diagnosis from medical literature and patient data.

Details

Motivation: Symbolic AI in healthcare requires significant manual effort to build knowledge bases, limiting adoption despite the importance of accurate disease prediction for timely intervention and treatment.

Method: McCoy uses LLMs to translate medical literature into ASP code, combines it with patient data, and processes it through an ASP solver to generate final diagnoses.

Result: Preliminary results show strong performance on small-scale disease diagnosis tasks, creating a robust and interpretable prediction framework.

Conclusion: The integration of LLMs with ASP overcomes the knowledge base construction barrier, yielding an interpretable diagnostic system that leverages both symbolic reasoning and language model capabilities.

Abstract: Accurate disease prediction is vital for timely intervention, effective treatment, and reducing medical complications. While symbolic AI has been applied in healthcare, its adoption remains limited due to the effort required for constructing high-quality knowledge bases. This work introduces McCoy, a framework that combines Large Language Models (LLMs) with Answer Set Programming (ASP) to overcome this barrier. McCoy orchestrates an LLM to translate medical literature into ASP code, combines it with patient data, and processes it using an ASP solver to arrive at the final diagnosis. This integration yields a robust, interpretable prediction framework that leverages the strengths of both paradigms. Preliminary results show McCoy has strong performance on small-scale disease diagnosis tasks.

Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury

Main category: cs.AI

TL;DR: SPARK is a multi-agent LLM framework for personalized search using specialized persona-based agents that collaborate through structured communication to deliver emergent personalization.

Details

Motivation: Current search systems struggle with static profiles and monolithic pipelines that cannot capture users' evolving, multi-dimensional information needs. There's a need for systems that can model the complexity, fluidity, and context sensitivity of human information-seeking behavior.

Method: SPARK uses coordinated persona-based LLM agents with formalized persona spaces (role, expertise, task context, domain). A Persona Coordinator dynamically activates relevant specialized agents for queries. Each agent has independent retrieval-augmented generation with dedicated memory stores and context-aware reasoning. Agents collaborate through structured communication protocols including shared memory repositories, iterative debate, and relay-style knowledge transfer.

Result: The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution. It incorporates adaptive learning mechanisms for continuous persona refinement and provides insights for next-generation search systems.

Conclusion: SPARK demonstrates how emergent personalization properties can arise from distributed agent behaviors governed by minimal coordination rules, integrating fine-grained agent specialization with cooperative retrieval to capture complex human information-seeking behavior.

Abstract: Personalized search demands the ability to model users’ evolving, multi-dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent-Driven Retrieval and Knowledge-sharing), a framework in which coordinated persona-based large language model (LLM) agents deliver task-specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval-augmented generation process, supported by dedicated long- and short-term memory stores and context-aware reasoning modules. Inter-agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay-style knowledge transfer. Drawing on principles from cognitive architectures, multi-agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine-grained agent specialization with cooperative retrieval, SPARK provides insights for next-generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information-seeking behavior.

[298] ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment

Natchaya Temyingyong, Daman Jain, Neeraj Kumarsahu, Prabhat Kumar, Rachata Phondi, Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

Main category: cs.AI

TL;DR: ROAD is a novel framework for automatic prompt optimization that uses multi-agent debugging instead of labeled datasets, achieving significant performance improvements with minimal iterations.

Details

Motivation: Current APO methods require large labeled datasets which are unavailable during cold start of agent development. Real-world software engineering faces messy production logs and evolving failure modes without curated datasets.

Method: ROAD uses a specialized multi-agent architecture: Analyzer for root-cause analysis, Optimizer for pattern aggregation, and Coach for strategy integration. It converts unstructured failure logs into structured Decision Tree Protocols, treating optimization as dynamic debugging rather than stochastic search.

Result: ROAD achieved 5.6% increase in success rate (73.6% to 79.2%) and 3.8% increase in search accuracy within three automated iterations. On complex retail reasoning tasks, it improved agent performance by ~19% relative to baseline.

Conclusion: Mimicking human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive RL training for deploying reliable LLM agents.

Abstract: Automatic Prompt Optimization (APO) has emerged as a critical technique for enhancing Large Language Model (LLM) performance, yet current state-of-the-art methods typically rely on large, labeled gold-standard development sets to compute fitness scores for evolutionary or Reinforcement Learning (RL) approaches. In real-world software engineering, however, such curated datasets are rarely available during the initial cold start of agent development, where engineers instead face messy production logs and evolving failure modes. We present ROAD (Reflective Optimization via Automated Debugging), a novel framework that bypasses the need for refined datasets by treating optimization as a dynamic debugging investigation rather than a stochastic search. Unlike traditional mutation strategies, ROAD utilizes a specialized multi-agent architecture, comprising an Analyzer for root-cause analysis, an Optimizer for pattern aggregation, and a Coach for strategy integration, to convert unstructured failure logs into robust, structured Decision Tree Protocols. We evaluated ROAD across both a standardized academic benchmark and a live production Knowledge Management engine. Experimental results demonstrate that ROAD is highly sample-efficient, achieving a 5.6 percent increase in success rate (73.6 percent to 79.2 percent) and a 3.8 percent increase in search accuracy within just three automated iterations. Furthermore, on complex reasoning tasks in the retail domain, ROAD improved agent performance by approximately 19 percent relative to the baseline. These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive RL training for deploying reliable LLM agents.

[299] LoongFlow: Directed Evolutionary Search via a Cognitive Plan-Execute-Summarize Paradigm

Chunhui Wan, Xunan Dai, Zhuo Wang, Minglei Li, Yanpeng Wang, Yinan Mao, Yu Lan, Zhiwen Xiao

Main category: cs.AI

TL;DR: LoongFlow is a self-evolving agent framework that integrates LLMs into evolutionary search through cognitive reasoning, achieving state-of-the-art solution quality with 60% better efficiency than existing methods.

Details

Motivation: Traditional evolutionary approaches for transitioning LLMs to self-improving agents lack structured reasoning, suffer from premature convergence, and inefficiently explore high-dimensional code spaces.

Method: Integrates LLMs into a cognitive “Plan-Execute-Summarize” (PES) paradigm, uses hybrid evolutionary memory system combining Multi-Island models with MAP-Elites and adaptive Boltzmann selection to balance exploration-exploitation trade-off.

Result: Outperforms leading baselines (OpenEvolve, ShinkaEvolve) by up to 60% in evolutionary efficiency while discovering superior solutions on AlphaEvolve benchmark and Kaggle competitions.

Conclusion: LoongFlow represents a substantial advancement in autonomous scientific discovery, enabling expert-level solution generation with reduced computational overhead through cognitive evolutionary search.

Abstract: The transition from static Large Language Models (LLMs) to self-improving agents is hindered by the lack of structured reasoning in traditional evolutionary approaches. Existing methods often struggle with premature convergence and inefficient exploration in high-dimensional code spaces. To address these challenges, we introduce LoongFlow, a self-evolving agent framework that achieves state-of-the-art solution quality with significantly reduced computational costs. Unlike “blind” mutation operators, LoongFlow integrates LLMs into a cognitive “Plan-Execute-Summarize” (PES) paradigm, effectively mapping the evolutionary search to a reasoning-heavy process. To sustain long-term architectural coherence, we incorporate a hybrid evolutionary memory system. By synergizing Multi-Island models with MAP-Elites and adaptive Boltzmann selection, this system theoretically balances the exploration-exploitation trade-off, maintaining diverse behavioral niches to prevent optimization stagnation. We instantiate LoongFlow with a General Agent for algorithmic discovery and an ML Agent for pipeline optimization. Extensive evaluations on the AlphaEvolve benchmark and Kaggle competitions demonstrate that LoongFlow outperforms leading baselines (e.g., OpenEvolve, ShinkaEvolve) by up to 60% in evolutionary efficiency while discovering superior solutions. LoongFlow marks a substantial step forward in autonomous scientific discovery, enabling the generation of expert-level solutions with reduced computational overhead.

[300] CogRec: A Cognitive Recommender Agent Fusing Large Language Models and Soar for Explainable Recommendation

Jiaxin Hu, Tao Wang, Bingsan Yang, Hongrun Wang

Main category: cs.AI

TL;DR: CogRec is a cognitive recommender agent that combines LLMs with the Soar cognitive architecture to address LLM limitations (black-box nature, hallucination, limited online learning) while leveraging Soar’s structured reasoning and LLM’s knowledge initialization capabilities.

Details

Motivation: LLMs have strong preference understanding for recommendations but suffer from black-box characteristics, knowledge hallucination, and limited online learning capacity, compromising trustworthiness and adaptability. Meanwhile, cognitive architectures like Soar offer structured, interpretable reasoning but have laborious knowledge acquisition.

Method: CogRec synergizes LLMs with Soar: uses Soar as symbolic reasoning engine, LLM for knowledge initialization to populate working memory with production rules. Operates on Perception-Cognition-Action cycle, dynamically queries LLM at impasses, transforms LLM solutions into new symbolic production rules via Soar’s chunking mechanism for online learning.

Result: Extensive evaluations on three public datasets show CogRec demonstrates significant advantages in recommendation accuracy, explainability, and efficacy in addressing the long-tail problem.

Conclusion: The proposed CogRec agent successfully combines LLMs’ knowledge capabilities with Soar’s structured reasoning to create a trustworthy, adaptable recommender system with interpretable rationales and continuous online learning.

Abstract: Large Language Models (LLMs) have demonstrated a remarkable capacity in understanding user preferences for recommendation systems. However, they are constrained by several critical challenges, including their inherent “Black-Box” characteristics, susceptibility to knowledge hallucination, and limited online learning capacity. These factors compromise their trustworthiness and adaptability. Conversely, cognitive architectures such as Soar offer structured and interpretable reasoning processes, yet their knowledge acquisition is notoriously laborious. To address these complementary challenges, we propose a novel cognitive recommender agent called CogRec which synergizes the strengths of LLMs with the Soar cognitive architecture. CogRec leverages Soar as its core symbolic reasoning engine and leverages an LLM for knowledge initialization to populate its working memory with production rules. The agent operates on a Perception-Cognition-Action(PCA) cycle. Upon encountering an impasse, it dynamically queries the LLM to obtain a reasoned solution. This solution is subsequently transformed into a new symbolic production rule via Soar’s chunking mechanism, thereby enabling robust online learning. This learning paradigm allows the agent to continuously evolve its knowledge base and furnish highly interpretable rationales for its recommendations. Extensive evaluations conducted on three public datasets demonstrate that CogRec demonstrates significant advantages in recommendation accuracy, explainability, and its efficacy in addressing the long-tail problem.

Zijian Zhao, Sen Li

Main category: cs.AI

TL;DR: Proposes two novel MARL methods (GRPO and OSPO) for AV order dispatch that bypass value function estimation by leveraging fleet homogeneity, achieving better performance than conventional approaches.

Details

Motivation: Conventional MARL approaches for ride-sharing order dispatch rely on accurate value function estimation, which becomes problematic in large-scale, uncertain environments with many vehicles, passengers, and orders.

Method: Two methods: 1) GRPO adapts Group Relative Policy Optimization to order dispatch by replacing PPO baseline with group average reward-to-go; 2) OSPO trains optimal policy using only one-step group rewards under homogeneous fleet assumption.

Result: Both GRPO and OSPO achieve promising performance across all scenarios, optimizing pickup times and number of served orders using simple MLP networks. OSPO outperforms GRPO in all scenarios by eliminating bias from bounded time horizon.

Conclusion: The proposed methods effectively address limitations of conventional MARL in large-scale ride-sharing systems by bypassing value function estimation and leveraging fleet homogeneity, with OSPO showing superior performance.

Abstract: Order dispatch is a critical task in ride-sharing systems with Autonomous Vehicles (AVs), directly influencing efficiency and profits. Recently, Multi-Agent Reinforcement Learning (MARL) has emerged as a promising solution to this problem by decomposing the large state and action spaces among individual agents, effectively addressing the Curse of Dimensionality (CoD) in transportation market, which is caused by the substantial number of vehicles, passengers, and orders. However, conventional MARL-based approaches heavily rely on accurate estimation of the value function, which becomes problematic in large-scale, highly uncertain environments. To address this issue, we propose two novel methods that bypass value function estimation, leveraging the homogeneous property of AV fleets. First, we draw an analogy between AV fleets and groups in Group Relative Policy Optimization (GRPO), adapting it to the order dispatch task. By replacing the Proximal Policy Optimization (PPO) baseline with the group average reward-to-go, GRPO eliminates critic estimation errors and reduces training bias. Inspired by this baseline replacement, we further propose One-Step Policy Optimization (OSPO), demonstrating that the optimal policy can be trained using only one-step group rewards under a homogeneous fleet. Experiments on a real-world ride-hailing dataset show that both GRPO and OSPO achieve promising performance across all scenarios, efficiently optimizing pickup times and the number of served orders using simple Multilayer Perceptron (MLP) networks. Furthermore, OSPO outperforms GRPO in all scenarios, attributed to its elimination of bias caused by the bounded time horizon of GRPO. Our code, trained models, and processed data are provided at https://github.com/RS2002/OSPO .

[302] Graph-Based Exploration for ARC-AGI-3 Interactive Reasoning Tasks

Evgenii Rudakov, Jonathan Shock, Benjamin Ultan Cowley

Main category: cs.AI

TL;DR: A training-free graph-based approach for interactive reasoning in ARC-AGI-3 benchmark, using visual frame processing and systematic state-space exploration with graph representations, outperforming state-of-the-art LLMs.

Details

Motivation: Current state-of-the-art LLMs fail to reliably solve interactive reasoning tasks in ARC-AGI-3 benchmark, which requires hypothesis formation, testing, and tracking discovered mechanics in game-like environments with increasing complexity.

Method: Combines vision-based frame processing with systematic state-space exploration using graph-structured representations. Segments visual frames into components, prioritizes actions based on visual salience, maintains directed graph of explored states and transitions, and tracks visited states to prioritize shortest paths to untested state-action pairs.

Result: Solves median of 30 out of 52 levels across six games on ARC-AGI-3 Preview Challenge, ranking 3rd on private leaderboard, substantially outperforming frontier LLM-based agents.

Conclusion: Explicit graph-structured exploration without learning serves as strong baseline for interactive reasoning, demonstrating importance of systematic state tracking and action prioritization in sparse-feedback environments where LLMs fail to capture task dynamics.

Abstract: We present a training-free graph-based approach for solving interactive reasoning tasks in the ARC-AGI-3 benchmark. ARC-AGI-3 comprises game-like tasks where agents must infer task mechanics through limited interactions, and adapt to increasing complexity as levels progress. Success requires forming hypotheses, testing them, and tracking discovered mechanics. The benchmark has revealed that state-of-the-art LLMs are currently incapable of reliably solving these tasks. Our method combines vision-based frame processing with systematic state-space exploration using graph-structured representations. It segments visual frames into meaningful components, prioritizes actions based on visual salience, and maintains a directed graph of explored states and transitions. By tracking visited states and tested actions, the agent prioritizes actions that provide the shortest path to untested state-action pairs. On the ARC-AGI-3 Preview Challenge, this structured exploration strategy solves a median of 30 out of 52 levels across six games and ranks 3rd on the private leaderboard, substantially outperforming frontier LLM-based agents. These results demonstrate that explicit graph-structured exploration, even without learning, can serve as a strong baseline for interactive reasoning and underscore the importance of systematic state tracking and action prioritization in sparse-feedback environments where current LLMs fail to capture task dynamics. The code is open source and available at https://github.com/dolphin-in-a-coma/arc-agi-3-just-explore.

[303] Deep Reinforcement Learning for Solving the Fleet Size and Mix Vehicle Routing Problem

Pengfu Wan, Jiawei Chen, Gangyan Xu

Main category: cs.AI

TL;DR: A deep reinforcement learning approach called FRIPN solves Fleet Size and Mix Vehicle Routing Problem efficiently by integrating fleet composition and routing decisions in a unified policy network.

Details

Motivation: FSMVRP is complex due to simultaneous fleet composition and routing decisions, especially challenging in large-scale, time-constrained real-world applications like vehicle rental and on-demand logistics.

Method: Formulate FSMVRP as Markov Decision Process, develop FRIPN policy network with specialized input embeddings (including remaining graph embedding) to integrate fleet composition and routing decisions.

Result: Method generates near-optimal solutions within seconds, shows computational efficiency and scalability advantages in large-scale, time-constrained scenarios on random and benchmark datasets.

Conclusion: The DRL-based approach has practical application potential and provides inspiration for extending DRL techniques to other VRP variants.

Abstract: The Fleet Size and Mix Vehicle Routing Problem (FSMVRP) is a prominent variant of the Vehicle Routing Problem (VRP), extensively studied in operations research and computational science. FSMVRP requires simultaneous decisions on fleet composition and routing, making it highly applicable to real-world scenarios such as short-term vehicle rental and on-demand logistics. However, these requirements also increase the complexity of FSMVRP, posing significant challenges, particularly in large-scale and time-constrained environments. In this paper, we propose a deep reinforcement learning (DRL)-based approach for solving FSMVRP, capable of generating near-optimal solutions within a few seconds. Specifically, we formulate the problem as a Markov Decision Process (MDP) and develop a novel policy network, termed FRIPN, that seamlessly integrates fleet composition and routing decisions. Our method incorporates specialized input embeddings designed for distinctdecision objectives, including a remaining graph embedding to facilitate effective vehicle employment decisions. Comprehensive experiments are conducted on both randomly generated instances and benchmark datasets. The experimental results demonstrate that our method exhibits notable advantages in terms of computational efficiency and scalability, particularly in large-scale and time-constrained scenarios. These strengths highlight the potential of our approach for practical applications and provide valuable inspiration for extending DRL-based techniques to other variants of VRP.

[304] Toward Autonomous Engineering Design: A Knowledge-Guided Multi-Agent Framework

Varun Kumar, George Em Karniadakis

Main category: cs.AI

TL;DR: A multi-agent AI framework for engineering design that uses specialized AI agents with structured knowledge graphs to collaboratively generate and refine designs, demonstrated through aerodynamic optimization of NACA airfoils.

Details

Motivation: Traditional engineering design processes are resource-intensive, inefficient, and require expertise from multiple domains, leading to complex collaborations and iterative refinements that need improvement.

Method: A three-agent AI framework: 1) Graph Ontologist uses LLM to build domain-specific knowledge graphs from literature; 2) Systems Engineer formulates technical requirements from human input; 3) Design Engineer uses knowledge graphs and computational tools to propose designs. They engage in iterative feedback loops with qualitative and quantitative review.

Result: Demonstrated successful application to aerodynamic optimization of 4-digit NACA airfoils, showing how collaborative AI agents can enhance design processes through structured knowledge representations and iterative refinement.

Conclusion: Collaborative AI agents equipped with structured knowledge representations can significantly enhance efficiency, consistency, and quality in engineering design processes, providing a framework for multi-domain design optimization.

Abstract: The engineering design process often demands expertise from multiple domains, leading to complex collaborations and iterative refinements. Traditional methods can be resource-intensive and prone to inefficiencies. To address this, we formalize the engineering design process through a multi-agent AI framework that integrates structured design and review loops. The framework introduces specialized knowledge-driven agents that collaborate to generate and refine design candidates. As an exemplar, we demonstrate its application to the aerodynamic optimization of 4-digit NACA airfoils. The framework consists of three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems Engineer. The Graph Ontologist employs a Large Language Model (LLM) to construct two domain-specific knowledge graphs from airfoil design literature. The Systems Engineer, informed by a human manager, formulates technical requirements that guide design generation and evaluation. The Design Engineer leverages the design knowledge graph and computational tools to propose candidate airfoils meeting these requirements. The Systems Engineer reviews and provides feedback both qualitative and quantitative using its own knowledge graph, forming an iterative feedback loop until a design is validated by the manager. The final design is then optimized to maximize performance metrics such as the lift-to-drag ratio. Overall, this work demonstrates how collaborative AI agents equipped with structured knowledge representations can enhance efficiency, consistency, and quality in the engineering design process.

[305] Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment

Lijun Zhang, Lin Li, Wei Wei, Yajie Qi, Huizhong Song, Jun Wang, Yaodong Yang, Jiye Liang

Main category: cs.AI

TL;DR: RSA is a risk-aware alignment method that uses nested risk measures to control safety risks during LLM fine-tuning, addressing both excessive model shift and rare catastrophic behaviors.

Details

Motivation: Existing safety alignment methods (like Safe RLHF and SACPO) operate under risk-neutral paradigms that fail to adequately address risks from policy deviations and lack robustness against rare but catastrophic harmful behaviors.

Method: Risk-aware Stepwise Alignment (RSA) formulates safety alignment as token-level risk-aware constrained policy optimization using nested risk measures, solved through stepwise alignment with token-level policy updates derived from these risk measures.

Result: Experimental results show RSA achieves high helpfulness while ensuring strong safety and significantly suppresses tail risks (low-probability yet high-impact unsafe responses).

Conclusion: RSA provides a novel risk-aware approach to LLM safety alignment that explicitly addresses both excessive model shift and rare catastrophic behaviors through nested risk measures, with theoretical guarantees on policy optimality.

Abstract: When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy updates derived from the nested risk measures. This design offers two key benefits: (1) it mitigates risks induced by excessive model shift away from a reference policy, and (2) it explicitly suppresses low-probability yet high-impact harmful behaviors. Moreover, we provide theoretical analysis on policy optimality under mild assumptions. Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety and significantly suppresses tail risks, namely low-probability yet high-impact unsafe responses.

[306] Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents

Seohui Bae, Jeonghye Kim, Youngchul Sung, Woohyung Lim

Main category: cs.AI

TL;DR: A test-time adaptive agent that performs exploratory inference through posterior-guided belief refinement without gradient updates or additional training for LLMs operating under partial observability.

Details

Motivation: LLM agents operating under partial observability need better ways to align with latent world states without requiring gradient-based updates or additional training, which can be computationally expensive.

Method: The agent maintains an external structured belief over environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over belief space using a lightweight LLM-based surrogate. World alignment is assessed through a novel reward quantifying consistency between posterior belief and ground-truth environment configuration.

Result: The method outperforms inference-time scaling baselines (prompt-augmented or retrieval-enhanced LLMs) in aligning with latent world states with significantly lower integration overhead.

Conclusion: Posterior-guided belief refinement enables effective test-time adaptation for LLM agents under partial observability without requiring gradient updates or additional training, achieving better world state alignment with lower computational overhead.

Abstract: In this paper, we propose a test-time adaptive agent that performs exploratory inference through posterior-guided belief refinement without relying on gradient-based updates or additional training for LLM agent operating under partial observability. Our agent maintains an external structured belief over the environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM-based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground-truth environment configuration. Experiments show that our method outperforms inference-time scaling baselines such as prompt-augmented or retrieval-enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.

[307] What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

Main category: cs.AI

TL;DR: The paper proposes JEPA-WMs, a family of world models that plan in learned representation space rather than input space, and conducts a comprehensive study to optimize their components for better planning efficiency in physical tasks.

Details

Motivation: Current AI agents struggle with generalization to unseen physical tasks and environments. While world models trained from trajectories help, planning in input space is inefficient. The motivation is to investigate representation-space planning methods that abstract irrelevant details for more efficient planning.

Method: The authors characterize existing methods as JEPA-WMs and conduct a comprehensive study of key components: model architecture, training objectives, and planning algorithms. They experiment with both simulated environments and real-world robotic data to find optimal approaches within this family.

Result: The proposed model outperforms two established baselines (DINO-WM and V-JEPA-2-AC) in both navigation and manipulation tasks. The study identifies optimal technical choices for representation-space planning.

Conclusion: Planning in learned representation space (JEPA-WMs) offers more efficient planning for physical tasks by abstracting irrelevant details. The comprehensive study provides guidance on optimal technical choices for this approach, with code and data made available for reproducibility.

Abstract: A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

[308] Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments

Zhiwei Wei, Yuxing Liu, Hua Liao, Wenjia Xu

Main category: cs.AI

TL;DR: Interactive evaluation framework for FM agents’ spatial understanding in symbolic map environments, showing exploration affects experience acquisition, memory representation is crucial for spatial consolidation, and reasoning schemes shape knowledge usage.

Details

Motivation: Most existing evaluations of spatial ability in foundation models rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding. There's a need to understand how FM agents explore, remember, and reason in map environments.

Method: Proposed interactive evaluation framework where agents incrementally explore partially observable grid-based maps (roads, intersections, POIs) receiving only local observations. Evaluated using six spatial tasks while systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models.

Result: Exploration primarily affects experience acquisition but has limited impact on final reasoning accuracy. Memory representation plays central role in consolidating spatial experience, with structured memories (sequential and graph-based) substantially improving performance on structure-intensive tasks like path planning. Reasoning schemes shape how stored knowledge is used, with advanced prompts supporting more effective multi-step inference. Spatial reasoning performance saturates beyond certain capability thresholds.

Conclusion: Improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone. The framework reveals distinct functional roles of exploration, memory, and reasoning components in FM agents’ spatial understanding.

Abstract: Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is therefore critical for enabling reliable map-based reasoning and applications. However, most existing evaluations of spatial ability in FMs rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding.In this paper, we propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments. Agents incrementally explore partially observable grid-based maps consisting of roads, intersections, and points of interest (POIs), receiving only local observations at each step. Spatial understanding is then evaluated using six kinds of spatial tasks. By systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models, we reveal distinct functional roles of these components. Exploration primarily affects experience acquisition but has a limited impact on final reasoning accuracy. In contrast, memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning. Reasoning schemes further shape how stored spatial knowledge is used, with advanced prompts supporting more effective multi-step inference. We further observe that spatial reasoning performance saturates across model versions and scales beyond a certain capability threshold, indicating that improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone.

[309] Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Samuel Golladay, Majid Bani-Yaghoub

Main category: cs.AI

TL;DR: LLMs struggle with underrepresented math competition problems, especially Geometry, with DeepSeek-V3 performing best but making computational errors, while GPT-4o-mini and Gemini show different error patterns.

Details

Motivation: Most LLM math reasoning studies use the same benchmark datasets, limiting generalizability and failing to capture diverse mathematical challenges. This study aims to analyze LLM performance on underrepresented mathematics competition problems to better understand their limitations.

Method: Tested three leading LLMs (GPT-4o-mini, Gemini-2.0-Flash, DeepSeek-V3) on Missouri Collegiate Mathematics Competition problems in Calculus, Analytic Geometry, and Discrete Mathematics. Compared LLM responses to known correct solutions and analyzed reasoning patterns to identify error types across problem domains.

Result: DeepSeek-V3 performed best across all three categories (Calculus, Analytic Geometry, Discrete Mathematics). All three LLMs showed notably weak performance in Geometry. DeepSeek-V3 errors were mainly computational/logical, GPT-4o-mini had logical/approach errors, and Gemini struggled with incomplete reasoning and rushed conclusions.

Conclusion: Evaluating LLMs on underrepresented mathematics competition datasets reveals distinct error patterns and highlights ongoing challenges in structured reasoning, particularly in Geometry. This approach provides deeper insights beyond standard benchmarks.

Abstract: Understanding the limitations of Large Language Models, or LLMs, in mathematical reasoning has been the focus of several recent studies. However, the majority of these studies use the same datasets for benchmarking, which limits the generalizability of their findings and may not fully capture the diverse challenges present in mathematical tasks. The purpose of the present study is to analyze the performance of LLMs on underrepresented mathematics competition problems. We prompted three leading LLMs, namely GPT-4o-mini, Gemini-2.0-Flash, and DeepSeek-V3, with the Missouri Collegiate Mathematics Competition problems in the areas of Calculus, Analytic Geometry, and Discrete Mathematics. The LLMs responses were then compared to the known correct solutions in order to determine the accuracy of the LLM for each problem domain. We also analyzed the LLMs reasoning to explore patterns in errors across problem types and models. DeepSeek-V3 has the best performance in all three categories of Calculus, Analytic Geometry, and Discrete Mathematics, both in reasoning and correct final answers. All three LLMs exhibited notably weak performance in Geometry. The majority of errors made by DeepSeek-V3 were attributed to computational and logical mistakes, whereas GPT-4o-mini frequently exhibited logical and approach-related errors. Gemini, on the other hand, tended to struggle with incomplete reasoning and drawing rushed conclusions. In conclusion, evaluating LLMs on underrepresented mathematics competition datasets can provide deeper insights into their distinct error patterns and highlight ongoing challenges in structured reasoning, particularly within the domain of Geometry.

[310] From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning

Amir Tahmasbi, Sadegh Majidi, Kazem Taram, Aniket Bera

Main category: cs.AI

TL;DR: A two-stage approach for spatial reasoning in LLMs: first fine-tune on basic spatial transformations, then train lightweight adapters for multi-step planning in puzzle environments.

Details

Motivation: LLMs have strong general language capabilities but struggle with spatial transformations and multi-step planning in structured environments, which limits applications in navigation and planning.

Method: Two-stage approach: 1) Supervised fine-tuning on elementary spatial transformations (rotation, translation, scaling) to build basic spatial physics understanding; 2) Freeze physics-aware model and train lightweight LoRA adapters within GRPO framework to learn policies for composing building blocks in multi-step planning.

Result: Method consistently outperforms baselines (generic backbone, physics-aware model, end-to-end RL models) in both Dynamic and Static environments. Also converges faster with more stable training compared to end-to-end RL from scratch.

Conclusion: The proposed decomposition approach effectively enhances spatial reasoning in LLMs, with attention pattern analysis confirming meaningful improvements in spatial understanding.

Abstract: Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.

[311] MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

Wenrui Liu, Zixiang Liu, Elsie Dai, Wenhan Yu, Lei Yu, Tong Yang

Main category: cs.AI

TL;DR: MCPAgentBench: A benchmark for evaluating LLM agents’ tool-use capabilities using real-world MCP definitions, simulated tools, and dynamic sandbox environments with distractor tools.

Details

Motivation: Current MCP evaluation sets have limitations: they rely on external MCP services and lack difficulty awareness. There's a need for better benchmarks to evaluate LLM agents' tool-use capabilities as they increasingly serve as autonomous agents using external tools via MCP.

Method: Propose MCPAgentBench benchmark with: 1) Dataset containing authentic tasks and simulated MCP tools, 2) Dynamic sandbox environment presenting agents with candidate tool lists containing distractors, 3) Comprehensive metrics measuring both task completion rates and execution efficiency.

Result: Experiments on various latest mainstream LLMs reveal significant performance differences in handling complex, multi-step tool invocations. All code is open-source on GitHub.

Conclusion: MCPAgentBench addresses limitations of current MCP evaluation methods and provides a robust framework for assessing LLM agents’ tool selection, discrimination, and execution capabilities in realistic scenarios.

Abstract: Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real-world MCP definitions designed to evaluate the tool-use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi-step tool invocations. All code is open-source at Github.

[312] Recursive Language Models

Alex L. Zhang, Tim Kraska, Omar Khattab

Main category: cs.AI

TL;DR: RLMs enable LLMs to process arbitrarily long prompts via recursive self-calling on prompt snippets, handling inputs 100x beyond context windows with better quality than base models.

Details

Motivation: LLMs have limited context windows that restrict their ability to process arbitrarily long prompts, creating a need for inference-time scaling solutions.

Method: Propose Recursive Language Models (RLMs) - an inference strategy where LLMs treat long prompts as external environments, programmatically examine/decompose them, and recursively call themselves on prompt snippets.

Result: RLMs handle inputs up to 100x beyond model context windows, outperform base LLMs and common long-context scaffolds across four diverse tasks, with comparable or cheaper cost per query.

Conclusion: RLMs provide an effective general inference strategy for processing arbitrarily long prompts, offering superior quality and scalability over existing approaches.

Abstract: We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

[313] Reinforcement Learning-Augmented LLM Agents for Collaborative Decision Making and Performance Optimization

Dong Qiu, Duo Xu, Limengxi Yue

Main category: cs.AI

TL;DR: A reinforcement learning framework for LLMs that enables multi-agent collaboration through Dec-POMDP formulation and CTDE, achieving significant improvements in collaborative writing and coding tasks.

Details

Motivation: LLMs perform well individually but lack collaborative awareness and struggle to optimize global performance in multi-agent settings, limiting their effectiveness in complex collaborative workflows.

Method: Formulates cooperation as Dec-POMDP with CTDE, introduces Group Relative Policy Optimization (GRPO) for joint policy optimization with global signals during training, and uses simplified joint reward balancing task quality, speed, and coordination cost.

Result: Achieves 3x increase in task processing speed over single-agent baselines, 98.7% structural/style consistency in writing, and 74.6% test pass rate in coding, consistently outperforming strong multi-agent LLM baselines.

Conclusion: The framework provides a practical path toward reliable collaboration in complex workflows by enabling effective multi-agent LLM cooperation through reinforcement learning and decentralized execution.

Abstract: Large Language Models (LLMs) perform well in language tasks but often lack collaborative awareness and struggle to optimize global performance in multi-agent settings. We present a reinforcement learning-augmented LLM agent framework that formulates cooperation as a decentralized partially observable Markov decision process (Dec-POMDP) and adopts centralized training with decentralized execution (CTDE). We introduce Group Relative Policy Optimization (GRPO) to jointly optimize agent policies with access to global signals during training, together with a simplified joint reward that balances task quality, speed, and coordination cost. On collaborative writing and coding benchmarks, our framework delivers a 3x increase in task processing speed over single-agent baselines, 98.7% structural/style consistency in writing, and a 74.6% test pass rate in coding. The approach consistently outperforms strong multi-agent LLM baselines and provides a practical path toward reliable collaboration in complex workflows.

[314] Group Deliberation Oriented Multi-Agent Conversational Model for Complex Reasoning

Zheyu Shi, Dong Qiu, Shanlong Yu

Main category: cs.AI

TL;DR: A multi-agent conversational model with three-level role division (generation, verification, integration) improves complex reasoning accuracy and consistency through self-game mechanisms, retrieval enhancement, and optimized collaborative training.

Details

Motivation: To address limitations of single large language models in complex reasoning tasks by leveraging group deliberation dynamics and multi-agent collaboration.

Method: Three-level role architecture: opinion generation agent, evidence verification agent (with retrieval), and consistency arbitration agent. Includes self-game mechanism for multi-path reasoning, retrieval enhancement module, composite reward function (factual consistency + logical coherence), and improved proximal policy optimization for collaborative training.

Result: Improves multi-hop reasoning accuracy by 16.8% on HotpotQA, 14.3% on 2WikiMultihopQA, and 19.2% on MeetingBank, while improving consistency by 21.5%. Achieves higher reasoning efficiency than mainstream multi-agent approaches.

Conclusion: The proposed group deliberation oriented multi-agent conversational model provides an effective and stable solution for complex reasoning tasks, outperforming existing approaches in both accuracy and consistency.

Abstract: This paper proposes a group deliberation oriented multi-agent conversational model to address the limitations of single large language models in complex reasoning tasks. The model adopts a three-level role division architecture consisting of generation, verification, and integration. An opinion generation agent produces diverse reasoning perspectives, an evidence verification agent retrieves external knowledge and quantifies factual support, and a consistency arbitration agent integrates logically coherent conclusions. A self-game mechanism is introduced to expand multi-path reasoning trajectories, while a retrieval enhancement module dynamically supplements external knowledge. A composite reward function combining factual consistency and logical coherence is designed, and an improved proximal policy optimization strategy is applied for collaborative training. Experimental results show that the proposed model improves multi-hop reasoning accuracy by 16.8 percent on HotpotQA, 14.3 percent on 2WikiMultihopQA, and 19.2 percent on MeetingBank, while improving consistency by 21.5 percent. The model achieves higher reasoning efficiency than mainstream multi-agent approaches, providing an effective and stable solution for complex reasoning tasks.

[315] Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun

Main category: cs.AI

TL;DR: Youtu-Agent is a modular LLM agent framework that automates agent generation and enables continuous evolution through workflow/meta-agent modes and hybrid policy optimization (in-context practice + RL training).

Details

Motivation: Address high configuration costs and static capabilities in existing LLM agent frameworks, which require extensive manual effort for tool integration/prompt engineering and struggle to adapt to dynamic environments without expensive fine-tuning.

Method: 1) Structured configuration system decoupling execution environments, toolkits, and context management; 2) Two generation paradigms: Workflow mode for standard tasks and Meta-Agent mode for complex requirements (auto-generates tool code, prompts, configurations); 3) Hybrid policy optimization: Agent Practice module for in-context optimization without parameter updates, and Agent RL module for scalable reinforcement learning.

Result: State-of-the-art performance on WebWalkerQA (71.47%) and GAIA (72.8%) with open-weight models; 81% tool synthesis success rate; Practice module improves AIME 2024/2025 performance by +2.7%/+5.4%; Agent RL achieves 40% speedup with steady improvement, enhancing coding/reasoning by 35% and searching by 21% on Maths and QA benchmarks.

Conclusion: Youtu-Agent successfully addresses configuration costs and static capabilities through automated generation and continuous evolution, demonstrating superior performance across multiple benchmarks and enabling practical deployment of adaptable LLM agents.

Abstract: Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47%) and GAIA (72.8%) using open-weight models. Our automated generation pipeline achieves over 81% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7% and +5.4% respectively. Moreover, our Agent RL training achieves 40% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35% and 21% on Maths and general/multi-hop QA benchmarks.

Pengcheng Xia, Yixiang Huang, Chengjin Qin, Chengliang Liu

Main category: cs.AI

TL;DR: Multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis under unseen working conditions

Details

Motivation: Existing fault diagnosis methods suffer performance decline under unseen working conditions, domain adaptation needs target domain samples, and single-modal approaches overlook complementary multi-modal information

Method: Dual disentanglement framework to separate modality-invariant/specific features and domain-invariant/specific representations; cross-domain mixed fusion for diversity augmentation; triple-modal fusion for adaptive multi-modal integration

Result: Outperforms advanced methods on induction motor fault diagnosis under both unseen constant and time-varying working conditions; ablation studies verify component effectiveness

Conclusion: Proposed multi-modal cross-domain approach effectively addresses generalization challenges in fault diagnosis by leveraging complementary multi-modal information and domain adaptation without target samples

Abstract: Intelligent fault diagnosis has become an indispensable technique for ensuring machinery reliability. However, existing methods suffer significant performance decline in real-world scenarios where models are tested under unseen working conditions, while domain adaptation approaches are limited to their reliance on target domain samples. Moreover, most existing studies rely on single-modal sensing signals, overlooking the complementary nature of multi-modal information for improving model generalization. To address these limitations, this paper proposes a multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis. A dual disentanglement framework is developed to decouple modality-invariant and modality-specific features, as well as domain-invariant and domain-specific representations, enabling both comprehensive multi-modal representation learning and robust domain generalization. A cross-domain mixed fusion strategy is designed to randomly mix modality information across domains for modality and domain diversity augmentation. Furthermore, a triple-modal fusion mechanism is introduced to adaptively integrate multi-modal heterogeneous information. Extensive experiments are conducted on induction motor fault diagnosis under both unseen constant and time-varying working conditions. The results demonstrate that the proposed method consistently outperforms advanced methods and comprehensive ablation studies further verify the effectiveness of each proposed component and multi-modal fusion. The code is available at: https://github.com/xiapc1996/MMDG.

[317] BatteryAgent: Synergizing Physics-Informed Interpretation with LLM Reasoning for Intelligent Battery Fault Diagnosis

Songqi Zhou, Ruixue Liu, Boman Su, Jiazhou Wang, Yixing Wang, Benben Jiang

Main category: cs.AI

TL;DR: BatteryAgent: A hierarchical framework integrating physical knowledge features with LLMs for interpretable fault diagnosis of lithium-ion batteries, moving beyond black-box binary classification to provide root cause analysis and maintenance recommendations.

Details

Motivation: Existing deep learning methods for battery fault diagnosis have two key limitations: 1) They are "black-box" models lacking interpretability, and 2) They use binary classification paradigms that cannot provide root cause analysis or maintenance recommendations. There's a need for more transparent, comprehensive diagnostic systems.

Method: Proposes BatteryAgent with three core modules: 1) Physical Perception Layer using 10 mechanism-based features from electrochemical principles; 2) Detection and Attribution Layer using Gradient Boosting Decision Trees and SHAP for feature contribution quantification; 3) Reasoning and Diagnosis Layer using LLM as agent core to create “numerical-semantic” bridge combining SHAP attributions with mechanism knowledge base.

Result: Achieves AUROC of 0.986, significantly outperforming state-of-the-art methods. Effectively corrects misclassifications on hard boundary samples. Extends traditional binary detection to multi-type interpretable diagnosis.

Conclusion: BatteryAgent offers a paradigm shift from “passive detection” to “intelligent diagnosis” for battery safety management, providing comprehensive reports with fault types, root cause analysis, and maintenance suggestions through interpretable, physics-informed AI.

Abstract: Fault diagnosis of lithium-ion batteries is critical for system safety. While existing deep learning methods exhibit superior detection accuracy, their “black-box” nature hinders interpretability. Furthermore, restricted by binary classification paradigms, they struggle to provide root cause analysis and maintenance recommendations. To address these limitations, this paper proposes BatteryAgent, a hierarchical framework that integrates physical knowledge features with the reasoning capabilities of Large Language Models (LLMs). The framework comprises three core modules: (1) A Physical Perception Layer that utilizes 10 mechanism-based features derived from electrochemical principles, balancing dimensionality reduction with physical fidelity; (2) A Detection and Attribution Layer that employs Gradient Boosting Decision Trees and SHAP to quantify feature contributions; and (3) A Reasoning and Diagnosis Layer that leverages an LLM as the agent core. This layer constructs a “numerical-semantic” bridge, combining SHAP attributions with a mechanism knowledge base to generate comprehensive reports containing fault types, root cause analysis, and maintenance suggestions. Experimental results demonstrate that BatteryAgent effectively corrects misclassifications on hard boundary samples, achieving an AUROC of 0.986, which significantly outperforms current state-of-the-art methods. Moreover, the framework extends traditional binary detection to multi-type interpretable diagnosis, offering a new paradigm shift from “passive detection” to “intelligent diagnosis” for battery safety management.

[318] Explaining Why Things Go Where They Go: Interpretable Constructs of Human Organizational Preferences

Emmanuel Fashae, Michael Burke, Leimin Tian, Lingheng Meng, Pamela Carreno-Medrano

Main category: cs.AI

TL;DR: The paper introduces four interpretable constructs for object arrangement preferences and validates them through a questionnaire study, then integrates them into an MCTS planner for robotic rearrangement.

Details

Motivation: Current robotic systems rely on latent preference models from human demonstrations that lack interpretability, limiting insight into the factors guiding human decisions about object arrangement.

Method: 1) Developed four interpretable constructs: spatial practicality, habitual convenience, semantic coherence, and commonsense appropriateness. 2) Designed and validated a self-report questionnaire through a 63-participant online study. 3) Integrated these constructs into a Monte Carlo Tree Search (MCTS) planner for robotic rearrangement.

Result: The study confirmed psychological distinctiveness of the four constructs and their explanatory power across kitchen and living room scenarios. The MCTS planner guided by participant-derived preferences generated reasonable arrangements that closely aligned with human-generated arrangements.

Conclusion: The work contributes a compact, interpretable formulation of object arrangement preferences and demonstrates how it can be operationalized for robot planning, moving beyond black-box latent models to transparent preference modeling.

Abstract: Robotic systems for household object rearrangement often rely on latent preference models inferred from human demonstrations. While effective at prediction, these models offer limited insight into the interpretable factors that guide human decisions. We introduce an explicit formulation of object arrangement preferences along four interpretable constructs: spatial practicality (putting items where they naturally fit best in the space), habitual convenience (making frequently used items easy to reach), semantic coherence (placing items together if they are used for the same task or are contextually related), and commonsense appropriateness (putting things where people would usually expect to find them). To capture these constructs, we designed and validated a self-report questionnaire through a 63-participant online study. Results confirm the psychological distinctiveness of these constructs and their explanatory power across two scenarios (kitchen and living room). We demonstrate the utility of these constructs by integrating them into a Monte Carlo Tree Search (MCTS) planner and show that when guided by participant-derived preferences, our planner can generate reasonable arrangements that closely align with those generated by participants. This work contributes a compact, interpretable formulation of object arrangement preferences and a demonstration of how it can be operationalized for robot planning.

[319] GenZ: Foundational models as latent variable generators within traditional statistical models

Marko Jojic, Nebojsa Jojic

Main category: cs.AI

TL;DR: GenZ bridges LLMs and statistical models by discovering interpretable semantic features through an iterative contrastive process, outperforming pure LLM baselines on house price prediction and collaborative filtering tasks.

Details

Motivation: Large language models have broad domain knowledge but fail to capture dataset-specific patterns crucial for prediction tasks. There's a need to combine LLMs' semantic understanding with statistical modeling's ability to learn from data-specific patterns.

Method: Uses an iterative process contrasting groups of items identified via statistical modeling errors to discover semantic feature descriptions. Formulated as generalized EM algorithm jointly optimizing semantic feature descriptors and statistical model parameters. Prompts frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets.

Result: House price prediction: 12% median relative error vs 38% for GPT-5 baseline. Netflix movie embeddings: 0.59 cosine similarity from semantic descriptions alone, matching performance requiring ~4000 user ratings in traditional collaborative filtering.

Conclusion: GenZ successfully bridges foundational models and statistical modeling, discovering dataset-specific semantic features that outperform pure LLM approaches and reveal patterns diverging from the model’s domain knowledge alone.

Abstract: We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features. While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks. Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model’s domain understanding. We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships. We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. On house prices, our model achieves 12% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38% error) that relies on the LLM’s general domain knowledge. For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions – matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model’s domain knowledge alone.

[320] A study on constraint extraction and exception exclusion in care worker scheduling

Koki Suenaga, Tomohiro Furuta, Satoshi Ono

Main category: cs.AI

TL;DR: A method using constraint templates to extract facility-specific scheduling constraints from manager interviews for long-term care facilities, with mechanisms to exclude exceptional constraints.

Details

Motivation: Long-term care facilities have varying conditions requiring facility-specific scheduling constraints, but existing automated scheduling technologies don't adequately capture these variations through manager interviews.

Method: Uses constraint templates to extract combinations of scheduling components (shift patterns, staff combinations) by adjusting parameters like number of days, staff members, and extraction focus (patterns or frequency). Includes mechanisms to exclude exceptional constraints.

Result: The method successfully created schedules satisfying all hard constraints and reduced soft constraint violations by avoiding extraction of exceptional constraints.

Conclusion: The proposed constraint template approach effectively extracts facility-specific scheduling constraints from manager interviews while handling exceptional cases, enabling better automated schedule generation for long-term care facilities.

Abstract: Technologies for automatically generating work schedules have been extensively studied; however, in long-term care facilities, the conditions vary between facilities, making it essential to interview the managers who create shift schedules to design facility-specific constraint conditions. The proposed method utilizes constraint templates to extract combinations of various components, such as shift patterns for consecutive days or staff combinations. The templates can extract a variety of constraints by changing the number of days and the number of staff members to focus on and changing the extraction focus to patterns or frequency. In addition, unlike existing constraint extraction techniques, this study incorporates mechanisms to exclude exceptional constraints. The extracted constraints can be employed by a constraint programming solver to create care worker schedules. Experiments demonstrated that our proposed method successfully created schedules that satisfied all hard constraints and reduced the number of violations for soft constraints by circumventing the extraction of exceptional constraints.

[321] Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Chonghuan Liu, ZhenDong Liu, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng

Main category: cs.AI

TL;DR: ALE is an end-to-end ecosystem for developing agentic LLMs with three components: ROLL for weight optimization, ROCK for trajectory generation, and iFlow CLI for context engineering. They release ROME, an open-source agent trained on 1M+ trajectories using novel data composition and IPA algorithm.

Details

Motivation: The open-source community lacks a principled, end-to-end ecosystem for developing agentic LLMs that can operate in real-world environments over multiple turns, taking actions, observing outcomes, and iteratively refining artifacts.

Method: ALE infrastructure with three components: 1) ROLL for post-training weight optimization, 2) ROCK as a sandbox environment manager for trajectory generation, and 3) iFlow CLI for efficient context engineering. Includes data composition protocols for complex behaviors and Interaction-based Policy Alignment (IPA) algorithm that assigns credit over semantic interaction chunks rather than individual tokens.

Result: Released ROME (open-source agent grounded by ALE) trained on over one million trajectories. Demonstrates strong performance on benchmarks like SWE-bench Verified and Terminal Bench. Introduced Terminal Bench Pro with improved scale and contamination control.

Conclusion: ALE provides an effective foundational infrastructure for agent LLM development, with ROME proving the ecosystem’s capabilities through strong benchmark performance and the introduction of improved evaluation methodologies.

Abstract: Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.

[322] Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing

Andrii Gamalii, Daniel Górniak, Robert Nowak, Bartłomiej Olber, Krystian Radlak, Jakub Winter

Main category: cs.AI

TL;DR: Semi-automated data annotation pipeline for driving scenarios in Polish conditions using human-in-the-loop AI to reduce annotation costs and time.

Details

Motivation: Manual annotation of large-scale multimodal driving datasets is expensive and time-consuming, especially for Polish driving conditions which require specialized datasets.

Method: Human-in-the-loop approach combining AI with human expertise, using 3D object detection for initial annotations, iterative model retraining, data anonymization, and domain adaptation techniques.

Result: Substantial time savings while ensuring consistent, high-quality annotations across different sensor modalities, accelerating dataset preparation for the DARTS project.

Conclusion: The developed pipeline effectively reduces annotation costs and duration while maintaining quality, supporting autonomous vehicle research in Poland through standardized large-scale datasets.

Abstract: This report presents the design and implementation of a semi-automated data annotation pipeline developed within the DARTS project, whose goal is to create a large-scale, multimodal dataset of driving scenarios recorded in Polish conditions. Manual annotation of such heterogeneous data is both costly and time-consuming. To address this challenge, the proposed solution adopts a human-in-the-loop approach that combines artificial intelligence with human expertise to reduce annotation cost and duration. The system automatically generates initial annotations, enables iterative model retraining, and incorporates data anonymization and domain adaptation techniques. At its core, the tool relies on 3D object detection algorithms to produce preliminary annotations. Overall, the developed tools and methodology result in substantial time savings while ensuring consistent, high-quality annotations across different sensor modalities. The solution directly supports the DARTS project by accelerating the preparation of large annotated dataset in the project’s standardized format, strengthening the technological base for autonomous vehicle research in Poland.

[323] Iterative Deployment Improves Planning Skills in LLMs

Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira, Yarin Gal

Main category: cs.AI

TL;DR: Iterative deployment of LLMs fine-tuned on user-curated data from previous deployments leads to significant model property changes, particularly improved planning skills and emergent generalization to longer plans, effectively implementing implicit reinforcement learning.

Details

Motivation: To investigate how iterative deployment of LLMs with user-curated fine-tuning data affects model properties over time, and to understand the underlying mechanisms driving these changes.

Method: Iterative deployment of LLMs where each model is fine-tuned on data carefully curated by users from previous model deployments, tested across various planning domains, with theoretical analysis connecting this process to reinforcement learning.

Result: Substantial improvements in planning skills with emergent generalization to much longer plans than initial models, showing that iterative deployment effectively implements RL training with an implicit reward function.

Conclusion: Iterative deployment creates an implicit RL training loop with important AI safety implications due to undefined reward functions, and offers an alternative training regime to explicit RL through data curation rather than explicit rewards.

Abstract: We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models’ deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

[324] AMAP Agentic Planning Technical Report

Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Tucheng Lin, Xin Li, Ning Guo

Main category: cs.AI

TL;DR: STAgent is a specialized LLM agent for spatio-temporal tasks like POI discovery and itinerary planning, featuring tool interaction capabilities while maintaining general performance.

Details

Motivation: To create an agentic LLM that can handle complex spatio-temporal reasoning tasks requiring exploration, verification, and refinement through tool interactions, while preserving general capabilities.

Method: Three key contributions: (1) stable tool environment with 10+ domain-specific tools for asynchronous rollout/training, (2) hierarchical data curation framework filtering high-quality queries at 1:10,000 ratio, (3) cascaded training recipe with seed SFT stage, second SFT on high-certainty queries, and RL stage on low-certainty data.

Result: STAgent shows promising performance on TravelBench while maintaining general capabilities across various benchmarks, demonstrating effectiveness of the agentic approach.

Conclusion: STAgent successfully combines specialized spatio-temporal reasoning with preserved general capabilities through innovative tool environment, data curation, and training methodology.

Abstract: We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries with a filter ratio of 1:10,000, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.

[325] Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings

Tianzhi He, Farrokh Jazizadeh

Main category: cs.AI

TL;DR: LLM-based BEMS AI agent framework for smart building energy management using natural language interaction, achieving 86% accuracy in device control but 49% in cost estimation.

Details

Motivation: Existing building energy management systems have limitations in providing context-aware insights and natural language interaction. The study aims to leverage LLMs' autonomous data analytics capabilities to create intelligent agents that can understand user queries and manage energy systems more effectively.

Method: Proposed a three-module framework: perception (sensing), central control (brain using LLMs), and action (actuation/user interaction). Evaluated prototype using 120 user queries across four real-world residential energy datasets, measuring latency, functionality, capability, accuracy, and cost-effectiveness. Used ANOVA tests to demonstrate generalizability.

Result: Promising performance with response accuracy: device control (86%), memory-related tasks (97%), scheduling/automation (74%), energy analysis (77%). Cost estimation tasks showed lower accuracy (49%). Demonstrated trade-off between response accuracy and computational efficiency.

Conclusion: The study formalizes assessment of LLM-based BEMS AI agents and identifies future research directions. While showing strong performance in many areas, complex tasks like cost estimation need improvement, highlighting the accuracy-efficiency trade-off in LLM-based energy management systems.

Abstract: This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype’s performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.

[326] Are Biological Systems More Intelligent Than Artificial Intelligence?

Michael Timothy Bennett

Main category: cs.AI

TL;DR: The paper argues biological systems are more intelligent than AI due to their ability to delegate adaptation across abstraction layers, unlike AI’s static stack architecture.

Details

Motivation: To understand why biological self-organizing systems appear more intelligent than artificial intelligence systems, and to provide a mathematical framework for comparing adaptability across different systems.

Method: Develops Stack Theory - modeling systems as stacks of abstraction layers, compares delegation of agentic control across computational, biological, military, governmental and economic systems. Proves “The Law of the Stack” theorem showing adaptability in higher layers requires sufficient adaptability in lower layers.

Result: Biological stacks are more intelligent because they delegate adaptation throughout the system, while contemporary AI has static stacks where lower layers remain fixed during deployment. Cancer-like failures occur in non-biological systems when delegation is inadequate. Mission command military doctrine provides a model for robust system design.

Conclusion: To build more robust and intelligent systems, we should design them to delegate control like biological systems, using mission command principles. Hybrid agent design involves pruning low-level policy spaces while preserving collective identity through appropriate boundary conditions.

Abstract: Are biological self-organising systems more intelligent' than artificial intelligence (AI)? If so, why? I explore this through a mathematical lens which frames intelligence in terms of adaptability. I model systems as stacks of abstraction layers (\emph{Stack Theory}) and compare them by how they delegate agentic control down their stacks, illustrating with examples of computational, biological, human military, governmental and economic systems. Contemporary AI rests on a static, human-engineered stack in which lower layers are static during deployment. Put provocatively, static stacks resemble inflexible bureaucracies, adapting only top-down. Biological stacks are more intelligent’ because they delegate adaptation. Formally, I prove a theorem (\emph{The Law of the Stack}) showing adaptability in higher layers requires sufficient adaptability in lower layers. Generalising bio-electric explanations of cancer as isolation from collective informational structures, I explore how cancer-like failures occur in non-biological systems when delegation is inadequate. This helps explain how to build more robust systems, by delegating control like the military doctrine of mission command. It also provides a design perspective on hybrid agents (e.g. organoids, systems involving both humans and AI): hybrid creation is a boundary-condition design problem in which human-imposed constraints prune low-level policy spaces to yield desired collective behaviour while preserving collective identity.

[327] Neurosymbolic Association Rule Mining from Tabular Data

Erkan Karabulut, Paul Groth, Victoria Degeler

Main category: cs.AI

TL;DR: Aerial+ is a neurosymbolic ARM method that uses autoencoders to create neural representations, extracts rules from these representations, and produces concise, high-quality rule sets with full data coverage.

Details

Motivation: High-dimensional datasets in Association Rule Mining often lead to rule explosion, increasing execution time and harming downstream task performance. Managing this rule explosion is a central challenge in ARM research.

Method: Aerial+ uses an under-complete autoencoder to create neural representations capturing feature associations, then extracts rules from these neural representations by exploiting the model’s reconstruction mechanism.

Result: Extensive evaluations on five datasets against seven baselines show Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable ML models, it significantly reduces execution time while maintaining or improving accuracy.

Conclusion: Aerial+ effectively addresses the rule explosion problem in ARM through a neurosymbolic approach, producing compact yet comprehensive rule sets that improve both efficiency and performance in downstream applications.

Abstract: Association Rule Mining (ARM) is the task of mining patterns among data features in the form of logical rules, with applications across a myriad of domains. However, high-dimensional datasets often result in an excessive number of rules, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion remains a central challenge in ARM research. To address this, we introduce Aerial+, a novel neurosymbolic ARM method. Aerial+ leverages an under-complete autoencoder to create a neural representation of the data, capturing associations between features. It extracts rules from this neural representation by exploiting the model’s reconstruction mechanism. Extensive evaluations on five datasets against seven baselines demonstrate that Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable machine learning models, Aerial+ significantly reduces execution time while maintaining or improving accuracy.

[328] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim

Main category: cs.AI

TL;DR: The paper proposes a reinforcement learning framework to teach AI agents contextual integrity - determining appropriate information disclosure based on context - using only ~700 synthetic examples, with improvements transferring to real-world privacy benchmarks.

Details

Motivation: As autonomous agents make decisions for users, ensuring contextual integrity (determining appropriate information to share in specific contexts) becomes crucial. Current AI systems need better reasoning about context to avoid inappropriate information disclosure.

Method: Two approaches: 1) Prompting LLMs to explicitly reason about contextual integrity, and 2) Developing a reinforcement learning framework that instills contextual reasoning. Uses a small synthetic dataset (~700 examples) with diverse contexts and information disclosure norms.

Result: The method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Improvements transfer from synthetic data to established CI benchmarks like PrivacyLens (human-annotated privacy leakage evaluation).

Conclusion: The proposed RL framework effectively teaches AI agents contextual integrity reasoning with minimal training data, demonstrating transferability to real-world privacy scenarios and providing a scalable approach to improving information disclosure decisions in autonomous agents.

Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) – what is the appropriate information to share while carrying out a certain task – becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls. Our code is available at: https://github.com/EricGLan/CI-RL

[329] RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Jianhao Chen, Mayi Xu, Haoyang Chen, Xiaohu Li, Xiangyu Zhang, Jianjie Huang, Zheng Wang, Xiaochun Cao, Tieyun Qian

Main category: cs.AI

TL;DR: The paper introduces Reasoning-Activated Jailbreak (RAJ) attacks that exploit LRMs’ reasoning chains to bypass safety protocols, and proposes a Principle-Guided Alignment (PGA) framework with a 3,989-sample dataset to enhance safety while preserving reasoning capabilities.

Details

Motivation: Large Reasoning Models have a unique safety vulnerability where their internal reasoning chains can generate harmful content even when final outputs appear safe. Current safety measures overlook this risk, creating a need for specialized defenses against reasoning-based attacks.

Method: 1) Propose RAJ attacks via concretization - refining malicious prompts to trigger harmful reasoning chains; 2) Develop PGA framework to construct safety alignment datasets by transforming harmful reasoning traces into safe, constructive responses; 3) Create PGA dataset with 3,989 verified samples.

Result: Fine-tuning LRMs with PGA dataset achieves up to 29.5% improvement in defense success rates across multiple jailbreak benchmarks. The approach effectively defends against reasoning-based attacks while preserving and even enhancing general reasoning capabilities.

Conclusion: The work provides a scalable pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance. PGA framework offers an effective solution to reasoning-based vulnerabilities in LRMs.

Abstract: Large Reasoning Models (LRMs) face a distinct safety vulnerability: their internal reasoning chains may generate harmful content even when the final output appears benign. To address this overlooked risk, we first propose a novel attack paradigm, Reasoning-Activated Jailbreak (RAJ) via Concretization, which demonstrates that refining malicious prompts to be more specific can trigger step-by-step logical reasoning that overrides the model’s safety protocols. To systematically mitigate this vulnerability, we further develop a scalable framework for constructing high-quality safety alignment datasets. This framework first leverages the RAJ attack to elicit challenging harmful reasoning chains from LRMs, then transforms these high-risk traces into safe, constructive, and educational responses through a tailored Principle-Guided Alignment (PGA) mechanism. Then, we introduce the PGA dataset, a verified alignment dataset containing 3,989 samples using our proposed method. Extensive experiments show that fine-tuning LRMs with PGA dataset significantly enhances model safety, achieving up to a 29.5% improvement in defense success rates across multiple jailbreak benchmarks. Critically, our approach not only defends against sophisticated reasoning-based attacks but also preserves, even enhances, the model’s general reasoning capabilities. This work provides a scalable and effective pathway for safety alignment in reasoning-intensive AI systems, addressing the core trade-off between safety and functional performance.

[330] Plan Verification for LLM-Based Embodied Task Completion Agents

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

Main category: cs.AI

TL;DR: LLM-based iterative verification framework improves noisy embodied AI task plans by having a Judge LLM critique actions and a Planner LLM apply revisions, achieving high precision/recall while preserving error-recovery patterns.

Details

Motivation: LLM-based task plans and human demonstrations for embodied AI often contain noisy actions, redundant navigation, and logical errors that reduce policy quality, necessitating a method to clean and refine these trajectories.

Method: Proposes an iterative verification framework where a Judge LLM critiques action sequences and a Planner LLM applies revisions, using natural language prompting for broad generalization across error types including irrelevant actions, contradictions, and missing steps.

Result: Achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout) on TEACh dataset. The refinement converges quickly (96.5% sequences require ≤3 iterations) while improving temporal efficiency and spatial organization.

Conclusion: Establishes plan verification as a reliable LLM capability for spatial planning and action refinement, providing a scalable path to higher-quality training data for imitation learning in embodied AI while preserving human error-recovery patterns for robust corrective behavior.

Abstract: Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

[331] TIM-PRM: Verifying multimodal reasoning with Tool-Integrated PRM

Peng Kuang, Xiangxiang Wang, Wentao Liu, Jian Dong, Kaidi Xu

Main category: cs.AI

TL;DR: TIM-PRM is a novel agentic framework that transforms verification from passive classification to active, tool-augmented investigation to address visual hallucinations and logical inconsistencies in multimodal LLMs.

Details

Motivation: Current Process Reward Models (PRMs) for multimodal mathematical reasoning suffer from sycophancy - they blindly validate flawed hypotheses rather than grounding them in visual reality, and standard outcome-based supervision fails to mitigate visual hallucinations and logical inconsistencies.

Method: TIM-PRM introduces an agentic framework that explicitly plans verification strategies and uses Independent Question Asking to query evidence via external tools, decoupling verification from reasoning context to eliminate confirmation bias. The method is instantiated by curating a high-quality dataset of tool-integrated verification trajectories.

Result: The 8B parameter TIM-PRM model surpasses existing open-source multimodal PRMs on VisualProcessBench, significantly outperforming much larger models like Qwen2.5-72B and InternVL-78B, while offering interpretable insights into the verification process.

Conclusion: TIM-PRM effectively addresses the limitations of current PRMs by transforming verification into an active, tool-augmented investigation that eliminates confirmation bias and provides more reliable verification for multimodal mathematical reasoning.

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive performances in mathematical reasoning, yet they remain vulnerable to visual hallucinations and logical inconsistencies that standard outcome-based supervision fails to mitigate. While Process Reward Models (PRMs) promise step-by-step verification, current approaches typically operate as scalar scorers or generative critics that suffer from sycophancy, blindly validating the flawed hypotheses rather than grounding them in visual reality. To bridge this gap, we introduce TIM-PRM (Tool-Integrated Multimodal PRM), a novel agentic framework that transforms verification from a passive classification task into an active, tool-augmented investigation. TIM-PRM is trained to explicitly plan verification strategies and utilizes a mechanism of Independent Question Asking to query evidence via external tools, effectively decoupling verification from the reasoning context to eliminate confirmation bias. We instantiate this method by curating a high-quality dataset of tool-integrated verification trajectories. Extensive experiments on VisualProcessBench demonstrate that our 8B parameter model surpasses existing open-source multimodal PRMs, significantly outperforming much larger models like Qwen2.5-72B and InternVL-78B, while offering interpretable insights into the verification process.

[332] On measuring grounding and generalizing grounding problems

Daniel Quigley, Eric Maynard

Main category: cs.AI

TL;DR: The paper reframes the symbol grounding problem as an audit across multiple desiderata (authenticity, preservation, faithfulness, robustness, compositionality) rather than a binary judgment, and applies this framework to analyze different grounding modes and case studies.

Details

Motivation: To move beyond the binary framing of the symbol grounding problem and provide a more nuanced, multi-dimensional framework for evaluating how symbols/tokens can genuinely be about things in the world rather than just being manipulated shapes in a calculus.

Method: Develops an audit framework with six desiderata: authenticity (internal mechanisms acquired through learning/evolution), preservation (atomic meanings intact), faithfulness (correlational and etiological), robustness (graceful degradation), and compositionality (systematic building from parts). Applies this to four grounding modes and three case studies.

Result: Model-theoretic semantics achieves exact composition but lacks etiological warrant; LLMs show correlational fit and local robustness for linguistic tasks but lack selection-for-success on world tasks without grounded interaction; human language meets all desiderata through evolutionary/developmental acquisition.

Conclusion: The framework operationalizes philosophical inquiry about representation, providing a common language and technical framework for systematic investigation of grounding and meaning across philosophy of science, computer science, linguistics, and mathematics.

Abstract: The symbol grounding problem asks how tokens like cat can be about cats, as opposed to mere shapes manipulated in a calculus. We recast grounding from a binary judgment into an audit across desiderata, each indexed by an evaluation tuple (context, meaning type, threat model, reference distribution): authenticity (mechanisms reside inside the agent and, for strong claims, were acquired through learning or evolution); preservation (atomic meanings remain intact); faithfulness, both correlational (realized meanings match intended ones) and etiological (internal mechanisms causally contribute to success); robustness (graceful degradation under declared perturbations); compositionality (the whole is built systematically from the parts). We apply this framework to four grounding modes (symbolic; referential; vectorial; relational) and three case studies: model-theoretic semantics achieves exact composition but lacks etiological warrant; large language models show correlational fit and local robustness for linguistic tasks, yet lack selection-for-success on world tasks without grounded interaction; human language meets the desiderata under strong authenticity through evolutionary and developmental acquisition. By operationalizing a philosophical inquiry about representation, we equip philosophers of science, computer scientists, linguists, and mathematicians with a common language and technical framework for systematic investigation of grounding and meaning.

[333] DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

Xuan Xie, Xuan Wang, Wenjie Wang, Shuai Chen, Wei Lin

Main category: cs.AI

TL;DR: DaGRPO improves GRPO for LLM reasoning by addressing training instability through sequence-level gradient rectification and off-policy data augmentation, achieving SOTA on math and OOD benchmarks.

Details

Motivation: GRPO enables post-training reasoning in LLMs but suffers from training instability and poor sample efficiency due to lack of distinctiveness in on-policy rollouts - homogeneous samples cause gradient conflicts for routine queries, while scarce positive samples hinder optimization for hard queries.

Method: DaGRPO introduces two mechanisms: 1) Sequence-level Gradient Rectification uses fine-grained scoring to dynamically mask low-distinctiveness sample pairs, eliminating gradient conflicts; 2) Off-policy Data Augmentation introduces high-quality anchors to recover training signals for challenging tasks.

Result: Extensive experiments on 9 mathematical reasoning and OOD benchmarks show DaGRPO significantly outperforms SFT, GRPO, and hybrid baselines, achieving +4.7% average accuracy gain on math benchmarks and effectively mitigating gradient explosion while accelerating long-chain reasoning emergence.

Conclusion: DaGRPO successfully addresses GRPO’s training instability by tackling distinctiveness issues, establishing new SOTA for post-training reasoning in LLMs and providing a robust framework for eliciting long-horizon reasoning capabilities.

Abstract: The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness, thereby eradicating gradient conflicts at the source; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks. Extensive experiments across 9 mathematical reasoning and out-of-distribution (OOD) generalization benchmarks demonstrate that DaGRPO significantly surpasses existing SFT, GRPO, and hybrid baselines, achieving new state-of-the-art performance (e.g., a +4.7% average accuracy gain on math benchmarks). Furthermore, in-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.

[334] HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition

Wang Lu, Yao Zhu, Jindong Wang

Main category: cs.AI

TL;DR: HAROOD is a comprehensive benchmark for human activity recognition in out-of-distribution settings, evaluating 16 methods across 4 OOD scenarios using 6 datasets with CNN and Transformer architectures.

Details

Motivation: Existing OOD algorithms for HAR are only tested in limited scenarios (like cross-device or cross-position), lacking comprehensive insights on their effectiveness across different distribution shifts. There's a need to understand whether OOD is necessary for HAR and which algorithms perform best.

Method: Proposed HAROOD benchmark with 4 defined OOD scenarios (cross-person, cross-position, cross-dataset, cross-time), built testbed covering 6 datasets, 16 comparative methods implemented with CNN-based and Transformer-based architectures, and two model selection protocols.

Result: Extensive experiments revealed that no single method consistently outperforms others across all scenarios, highlighting substantial opportunity for advancement in OOD-based HAR research.

Conclusion: HAROOD provides a modular, extensible benchmark to facilitate OOD-based HAR research, with findings showing current limitations and opportunities for improvement in handling distribution shifts across different scenarios.

Abstract: Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.

[335] A Geometric Theory of Cognition

Laha Ale

Main category: cs.AI

TL;DR: A unified geometric framework for cognition where diverse cognitive processes emerge from Riemannian gradient flow of a cognitive potential on a learned metric manifold.

Details

Motivation: Current cognitive science explains different cognitive capacities (perception, memory, reasoning, etc.) through distinct computational theories, lacking a unified mathematical framework. The authors aim to provide a single geometric principle that can explain diverse cognitive phenomena.

Method: Represent cognitive state as a point on a differentiable manifold with learned Riemannian metric encoding representational constraints, costs, and structural relations. Define a scalar cognitive potential combining predictive accuracy, structural parsimony, task utility, and normative requirements. Cognition unfolds as Riemannian gradient flow of this potential.

Result: Dual-process effects (intuitive vs. deliberative reasoning) emerge naturally from metric-induced anisotropies creating time-scale separations and geometric phase transitions. Analytical conditions for different cognitive regimes are derived and demonstrated through simulations of canonical cognitive tasks.

Conclusion: The framework establishes a geometric foundation for cognition and provides guiding principles for developing more general and human-like artificial intelligence systems by unifying diverse cognitive phenomena under a single mathematical principle.

Abstract: Human cognition spans perception, memory, intuitive judgment, deliberative reasoning, action selection, and social inference, yet these capacities are often explained through distinct computational theories. Here we present a unified mathematical framework in which diverse cognitive processes emerge from a single geometric principle. We represent the cognitive state as a point on a differentiable manifold endowed with a learned Riemannian metric that encodes representational constraints, computational costs, and structural relations among cognitive variables. A scalar cognitive potential combines predictive accuracy, structural parsimony, task utility, and normative or logical requirements. Cognition unfolds as the Riemannian gradient flow of this potential, providing a universal dynamical law from which a broad range of psychological phenomena arise. Classical dual-process effects–rapid intuitive responses and slower deliberative reasoning–emerge naturally from metric-induced anisotropies that generate intrinsic time-scale separations and geometric phase transitions, without invoking modular or hybrid architectures. We derive analytical conditions for these regimes and demonstrate their behavioural signatures through simulations of canonical cognitive tasks. Together, these results establish a geometric foundation for cognition and suggest guiding principles for the development of more general and human-like artificial intelligence systems.

[336] SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai

Main category: cs.AI

TL;DR: SciEvalKit is a unified benchmarking toolkit for evaluating AI models across scientific disciplines, focusing on core scientific competencies like multimodal reasoning, symbolic reasoning, and knowledge understanding.

Details

Motivation: There's a need for specialized evaluation platforms for scientific AI models that go beyond general-purpose benchmarks, addressing authentic scientific challenges across multiple disciplines.

Method: The toolkit builds expert-grade scientific benchmarks from real-world, domain-specific datasets, featuring a flexible evaluation pipeline that supports batch evaluation, custom model/dataset integration, and ensures transparent, reproducible results.

Result: SciEvalKit provides a standardized yet customizable infrastructure covering six major scientific domains (physics, chemistry, astronomy, materials science, etc.) with support for seven core scientific competencies.

Conclusion: The open-source toolkit bridges capability-based evaluation with disciplinary diversity, offering a foundation for benchmarking next-generation scientific foundation models and fostering community-driven AI4Science progress.

Abstract: We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.

[337] Multimodal Fact-Checking: An Agent-based Approach

Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

Main category: cs.AI

TL;DR: The paper introduces RW-Post, a high-quality explainable dataset for real-world multimodal fact-checking, and AgentFact, an agent-based framework that improves accuracy and interpretability by emulating human verification workflows.

Details

Motivation: Existing multimodal fact-checking systems have limitations in reasoning and evidence utilization due to lack of comprehensive datasets with annotated reasoning processes and verifiable evidence for real-world misinformation.

Method: 1) Created RW-Post dataset by aligning real-world multimodal claims with original social media posts and extracting detailed reasoning/evidence from human fact-checking articles using LLM-assisted pipeline. 2) Proposed AgentFact framework with five specialized agents for strategy planning, evidence retrieval, visual analysis, reasoning, and explanation generation, using iterative workflow of evidence searching and filtering.

Result: Extensive experiments show that the combination of RW-Post dataset and AgentFact framework substantially improves both accuracy and interpretability of multimodal fact-checking compared to existing approaches.

Conclusion: The synergy between high-quality explainable datasets (RW-Post) and agent-based verification frameworks (AgentFact) addresses key bottlenecks in multimodal misinformation detection, enabling more accurate and interpretable fact-checking systems.

Abstract: The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.

[338] InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Yu Li, Tian Lan, Zhengling Qi

Main category: cs.AI

TL;DR: InSPO addresses DPO limitations by enabling LLMs to condition on alternative responses during preference optimization, yielding globally optimal policies invariant to modeling choices and improving alignment through self-reflection.

Details

Motivation: DPO and variants have limitations: optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), and treating responses in isolation fails to leverage comparative information in pairwise data, leaving self-reflection capacity untapped.

Method: Proposes Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy that conditions on both context and alternative responses, serving as plug-and-play enhancement without architectural changes or inference overhead.

Result: InSPO demonstrates consistent improvements in win rates and length-controlled metrics, proving superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices.

Conclusion: Unlocking self-reflection yields more robust, human-aligned LLMs, with InSPO providing a theoretically sound and practically effective enhancement to existing preference optimization methods.

Abstract: Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model’s capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.

[339] CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang

Main category: cs.AI

TL;DR: CubeBench is a new benchmark using Rubik’s Cube to test LLM agents’ spatial reasoning, mental simulation, and active exploration abilities, revealing severe limitations in long-horizon planning with 0% success rates.

Details

Motivation: LLM agents struggle with physical-world deployment due to poor spatial mental models. The paper identifies three core cognitive challenges: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation.

Method: Introduces CubeBench, a generative benchmark centered on Rubik’s Cube with a three-tiered diagnostic framework: 1) foundational state tracking with full symbolic information, 2) intermediate tasks, and 3) active exploration with only partial visual data. Also proposes a diagnostic framework using external solver tools to isolate cognitive bottlenecks.

Result: Experiments on leading LLMs show critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing fundamental failure in long-term planning. The benchmark reveals specific failure modes in spatial reasoning and mental simulation.

Conclusion: CubeBench exposes fundamental gaps in LLM agents’ spatial cognitive abilities needed for physical-world deployment. The analysis of failure modes provides key insights to guide development of more physically-grounded intelligent agents.

Abstract: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik’s Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.

[340] Physics-Informed Neural Networks for Device and Circuit Modeling: A Case Study of NeuroSPICE

Chien-Ting Tung, Chenming Hu

Main category: cs.AI

TL;DR: NeuroSPICE is a PINN-based circuit simulator that solves DAEs using neural networks instead of traditional numerical solvers, offering flexibility for emerging devices and design optimization.

Details

Motivation: To overcome limitations of conventional SPICE's time-discretized numerical solvers and enable simulation of emerging nonlinear devices like ferroelectric memories, while providing surrogate models for design optimization.

Method: Uses physics-informed neural networks (PINNs) to solve circuit differential-algebraic equations by minimizing equation residuals through backpropagation, modeling waveforms with analytical equations in time domain with exact temporal derivatives.

Result: PINNs don’t outperform SPICE in speed or accuracy during training, but offer unique advantages: surrogate models for design optimization, inverse problem solving, and flexibility for simulating emerging nonlinear devices.

Conclusion: NeuroSPICE provides a flexible PINN-based alternative to conventional SPICE with specialized advantages for design optimization, inverse problems, and emerging device simulation, despite not matching SPICE’s raw speed/accuracy.

Abstract: We present NeuroSPICE, a physics-informed neural network (PINN) framework for device and circuit simulation. Unlike conventional SPICE, which relies on time-discretized numerical solvers, NeuroSPICE leverages PINNs to solve circuit differential-algebraic equations (DAEs) by minimizing the residual of the equations through backpropagation. It models device and circuit waveforms using analytical equations in time domain with exact temporal derivatives. While PINNs do not outperform SPICE in speed or accuracy during training, they offer unique advantages such as surrogate models for design optimization and inverse problems. NeuroSPICE’s flexibility enables the simulation of emerging devices, including highly nonlinear systems such as ferroelectric memories.

cs.SD

[341] Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack

Roee Ziv, Raz Lapid, Moshe Sipper

Main category: cs.SD

TL;DR: Universal targeted latent space attack manipulates audio encoder representations to control downstream language model outputs, revealing security vulnerabilities in audio-language models.

Details

Motivation: Audio-language models introduce new security vulnerabilities, and existing attacks focus on waveform-level or input-specific approaches. The authors aim to explore a previously underexplored attack surface at the encoder level of multimodal systems.

Method: Proposes a universal targeted latent space attack that learns a universal perturbation at the encoder level. This perturbation generalizes across inputs and speakers, manipulates audio latent representations, and induces attacker-specified outputs in downstream language generation without requiring access to the language model.

Result: Experiments on Qwen2-Audio-7B-Instruct demonstrate consistently high attack success rates with minimal perceptual distortion, showing the effectiveness of the universal perturbation approach.

Conclusion: The attack reveals a critical security vulnerability at the encoder level of multimodal audio-language systems, highlighting the need for better security measures in audio-language models.

Abstract: Audio-language models combine audio encoders with large language models to enable multimodal reasoning, but they also introduce new security vulnerabilities. We propose a universal targeted latent space attack, an encoder-level adversarial attack that manipulates audio latent representations to induce attacker-specified outputs in downstream language generation. Unlike prior waveform-level or input-specific attacks, our approach learns a universal perturbation that generalizes across inputs and speakers and does not require access to the language model. Experiments on Qwen2-Audio-7B-Instruct demonstrate consistently high attack success rates with minimal perceptual distortion, revealing a critical and previously underexplored attack surface at the encoder level of multimodal systems.

[342] AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Xin Li, Peijie Qiu, Hao Wang, Xuanzhao Dong, Yujian Xiong, Anderson Schneider, Yuriy Nevmyvaka, Yalin Wang

Main category: cs.SD

TL;DR: AHA framework reduces hallucinations in Large Audio-Language Models by creating counterfactual training data and diagnostic evaluation, improving Qwen2.5-Omni’s performance on both specialized and public benchmarks.

Details

Motivation: Large Audio-Language Models suffer from hallucinations where they generate text not grounded in audio input, requiring systematic analysis and solution for improved reliability.

Method: Introduces AHA framework with counterfactual hard negative mining to create preference datasets that force models to distinguish acoustic evidence from fabrications, plus AHA-Eval diagnostic benchmark for temporal reasoning.

Result: Qwen-Audio-AHA achieves 13.7% improvement on AHA-Eval, 1.3% on MMAU-Test, and 1.6% on MMAR, outperforming SOTA methods with benefits generalizing beyond diagnostic sets.

Conclusion: The AHA framework effectively addresses audio hallucinations through systematic taxonomy and counterfactual training, leading to improved model reliability and performance across multiple benchmarks.

Abstract: Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.

[343] PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Guanjie Huang, Pengfei Zhang, Kai Jiang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: PhyAVBench is a new benchmark for evaluating text-to-audio-video models’ understanding of audio physics, featuring 1,000 paired prompts with controlled physical variables across 6 physics dimensions and 4 scenarios.

Details

Motivation: Existing T2AV models lack physical plausibility in generated sounds due to limited understanding of physical principles. Current benchmarks focus mainly on audio-video synchronization rather than physical grounding.

Method: Created PhyAVBench with 1,000 groups of paired text prompts controlling physical variables that implicitly induce sound variations. Covers 6 audio physics dimensions, 4 daily scenarios, and 50 fine-grained test points. Each prompt grounded by at least 20 newly recorded real-world videos to prevent data leakage.

Result: Developed a comprehensive benchmark enabling fine-grained assessment of models’ sensitivity to acoustic condition changes through Audio-Physics Sensitivity Test (APST) paradigm.

Conclusion: Only models with genuine understanding of audio physics can generate physically consistent audio-visual content. PhyAVBench aims to stimulate progress in this underexplored domain.

Abstract: Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models’ sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models’ understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.

[344] Environmental Sound Deepfake Detection Challenge: An Overview

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Main category: cs.SD

TL;DR: The paper introduces EnvSDD, the first large-scale dataset for environmental sound deepfake detection, and presents the ESDD Challenge results from ICASSP 2026.

Details

Motivation: Audio generation models can create realistic soundscapes but raise concerns about misuse for deceptive content. Existing environmental sound deepfake detection datasets are limited in scale and diversity.

Method: Created EnvSDD, a large-scale curated dataset for environmental sound deepfake detection, and launched the ESDD Challenge as an ICASSP 2026 Grand Challenge.

Result: The paper presents the ESDD Challenge overview and detailed analysis of challenge results, establishing a benchmark for environmental sound deepfake detection.

Conclusion: EnvSDD addresses the gap in environmental sound deepfake detection datasets, and the ESDD Challenge provides valuable insights and benchmarks for the research community.

Abstract: Recent progress in audio generation models has made it possible to create highly realistic and immersive soundscapes, which are now widely used in film and virtual-reality-related applications. However, these audio generators also raise concerns about potential misuse, such as producing deceptive audio for fabricated videos or spreading misleading information. Therefore, it is essential to develop effective methods for detecting fake environmental sounds. Existing datasets for environmental sound deepfake detection (ESDD) remain limited in both scale and the diversity of sound categories they cover. To address this gap, we introduced EnvSDD, the first large-scale curated dataset designed for ESDD. Based on EnvSDD, we launched the ESDD Challenge, recognized as one of the ICASSP 2026 Grand Challenges. This paper presents an overview of the ESDD Challenge, including a detailed analysis of the challenge results.

[345] Structuring Concept Space with the Musical Circle of Fifths by Utilizing Music Grammar Based Activations

Tofara Moyo, Panashe Chiurunge

Main category: cs.SD

TL;DR: Neural coding framework uses harmonic toroidal codes to implement cognitive operations via dynamical activity on manifolds derived from music theory structures.

Details

Motivation: To develop a neural coding framework that bridges cognitive operations with mathematical structures from music theory, potentially offering new insights into how the brain processes complex information.

Method: Proposes harmonic toroidal codes that implement cognitive operations through dynamical activity on manifolds derived from music theoretic structures.

Result: A novel neural coding framework that connects cognitive operations with music theory structures via toroidal manifolds.

Conclusion: The harmonic toroidal codes framework provides a promising approach to understanding cognitive operations through mathematical structures from music theory.

Abstract: We propose a neural coding framework harmonic toroidal codes in which abstract cognitive operations are implemented through dynamical activity on manifolds derived from music theoretic structures.

[346] AI-Driven Acoustic Voice Biomarker-Based Hierarchical Classification of Benign Laryngeal Voice Disorders from Sustained Vowels

Mohsen Annabestani, Samira Aghadoost, Anais Rameau, Olivier Elemento, Gloria Chia-Yi Chiang

Main category: cs.SD

TL;DR: Hierarchical ML framework for classifying 8 benign voice disorders from acoustic features of sustained vowels, outperforming flat classifiers and pre-trained models.

Details

Motivation: Benign laryngeal voice disorders affect 20% of people and serve as non-invasive indicators of broader physiological dysfunction, requiring automated classification tools for early screening and diagnostic triage.

Method: Three-stage hierarchical framework: 1) Binary screening (pathological vs non-pathological) using CNN mel-spectrogram features + 21 acoustic biomarkers; 2) Stratification into Healthy, Functional/Psychogenic, Structural/Inflammatory groups with cubic SVM; 3) Fine-grained classification incorporating probabilistic outputs from prior stages.

Result: Outperformed flat multi-class classifiers and pre-trained models (META HuBERT, Google HeAR) on 15,132 recordings from 1,261 speakers, improving discrimination of structural/inflammatory vs functional disorders.

Conclusion: Combining deep spectral representations with interpretable acoustic features enhances clinical transparency and alignment, demonstrating potential for scalable, non-invasive voice biomarkers in early screening, diagnostic triage, and longitudinal monitoring.

Abstract: Benign laryngeal voice disorders affect nearly one in five individuals and often manifest as dysphonia, while also serving as non-invasive indicators of broader physiological dysfunction. We introduce a clinically inspired hierarchical machine learning framework for automated classification of eight benign voice disorders alongside healthy controls, using acoustic features extracted from short, sustained vowel phonations. Experiments utilized 15,132 recordings from 1,261 speakers in the Saarbruecken Voice Database, covering vowels /a/, /i/, and /u/ at neutral, high, low, and gliding pitches. Mirroring clinical triage workflows, the framework operates in three sequential stages: Stage 1 performs binary screening of pathological versus non-pathological voices by integrating convolutional neural network-derived mel-spectrogram features with 21 interpretable acoustic biomarkers; Stage 2 stratifies voices into Healthy, Functional or Psychogenic, and Structural or Inflammatory groups using a cubic support vector machine; Stage 3 achieves fine-grained classification by incorporating probabilistic outputs from prior stages, improving discrimination of structural and inflammatory disorders relative to functional conditions. The proposed system consistently outperformed flat multi-class classifiers and pre-trained self-supervised models, including META HuBERT and Google HeAR, whose generic objectives are not optimized for sustained clinical phonation. By combining deep spectral representations with interpretable acoustic features, the framework enhances transparency and clinical alignment. These results highlight the potential of quantitative voice biomarkers as scalable, non-invasive tools for early screening, diagnostic triage, and longitudinal monitoring of vocal health.

[347] AudioFab: Building A General and Intelligent Audio Factory through Tool Learning

Cheng Zhu, Jing Han, Qianshuai Xue, Kehan Wang, Huan Zhao, Zixing Zhang

Main category: cs.SD

TL;DR: AudioFab is an open-source agent framework that provides a unified, modular platform for audio AI tools, solving dependency conflicts and enabling efficient tool collaboration through intelligent selection and few-shot learning.

Details

Motivation: Current audio AI tools are fragmented with complex configurations and inefficient collaboration, lacking a unified framework to unlock their full potential. Existing frameworks suffer from dependency conflicts and poor tool integration.

Method: Introduces a modular design that resolves dependency conflicts, simplifies tool integration and extension. Uses intelligent tool selection and few-shot learning for efficient tool collaboration. Provides a user-friendly natural language interface for non-experts.

Result: Created AudioFab as an open-source framework that establishes an open and intelligent audio-processing ecosystem. The framework offers improved efficiency and accuracy in complex audio tasks compared to existing solutions.

Conclusion: AudioFab provides a stable, extensible platform for future audio and multimodal AI research and development, addressing current fragmentation in the audio AI domain with its modular, user-friendly approach.

Abstract: Currently, artificial intelligence is profoundly transforming the audio domain; however, numerous advanced algorithms and tools remain fragmented, lacking a unified and efficient framework to unlock their full potential. Existing audio agent frameworks often suffer from complex environment configurations and inefficient tool collaboration. To address these limitations, we introduce AudioFab, an open-source agent framework aimed at establishing an open and intelligent audio-processing ecosystem. Compared to existing solutions, AudioFab’s modular design resolves dependency conflicts, simplifying tool integration and extension. It also optimizes tool learning through intelligent selection and few-shot learning, improving efficiency and accuracy in complex audio tasks. Furthermore, AudioFab provides a user-friendly natural language interface tailored for non-expert users. As a foundational framework, AudioFab’s core contribution lies in offering a stable and extensible platform for future research and development in audio and multimodal AI. The code is available at https://github.com/SmileHnu/AudioFab.

[348] Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

Euiyeon Kim, Yong-Hoon Choi

Main category: cs.SD

TL;DR: A new music source separation model using Mamba2 state space model for vocal isolation achieves state-of-the-art performance with 11.03 dB cSDR.

Details

Motivation: Transformer-based approaches often fail to capture intermittently occurring vocals in music source separation, creating a need for better long-range temporal dependency modeling.

Method: Combines Mamba2 (a recent state space model) with band-splitting strategy and dual-path architecture to efficiently handle long input sequences and capture long-range temporal dependencies.

Result: Outperforms recent state-of-the-art models with 11.03 dB cSDR (best reported to date), substantial gains in uSDR, and stable performance across varying input lengths and vocal patterns.

Conclusion: Mamba-based models are effective for high-resolution audio processing and open new directions for broader audio research applications.

Abstract: We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.

[349] SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models

Yuan-Kuei Wu, Yang Liu, Yiteng Huang, Zhaojun Yang, Haibin Wu, Ruizhe Huang, Yi-Te, Hsu, Shuyu Kong, Ming Sun, Florian Metze, Li Wan

Main category: cs.SD

TL;DR: First test-time adaptation framework for generative spoken language models that updates a small parameter subset during inference using only incoming utterances, improving robustness to acoustic variability without degrading core task performance.

Details

Motivation: Spoken Language Models degrade under real-world acoustic shifts (noise, reverberation, microphone variation), and existing offline domain adaptation solutions are post-hoc, data-intensive, and slow.

Method: Test-time adaptation framework that updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. It stabilizes token distributions and improves robustness to acoustic variability.

Result: Consistent gains under diverse corruptions across automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench. Adaptation is compute- and memory-efficient as it touches only a small fraction of weights.

Conclusion: Enhances robustness and adaptability of generative SLMs for real-world speech-driven applications, supporting deployment on resource-constrained platforms through efficient test-time adaptation.

Abstract: Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-intensive, and slow. We introduce the first test-time adaptation (TTA) framework for generative SLMs that process interleaved audio-text prompts. Our method updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. This stabilizes token distributions and improves robustness to acoustic variability without degrading core task accuracy. Evaluated on automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench, our approach yields consistent gains under diverse corruptions. Because adaptation touches only a small fraction of weights, it is both compute- and memory-efficient, supporting deployment on resource-constrained platforms. This work enhances the robustness and adaptability of generative SLMs for real-world speech-driven applications.

[350] STSR: High-Fidelity Speech Super-Resolution via Spectral-Transient Context Modeling

Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv

Main category: cs.SD

TL;DR: STSR is a unified MDCT-domain framework for speech super-resolution that balances harmonic coherence and transient sharpness using spectral-contextual attention and sparse-aware regularization, achieving real-time performance with superior fidelity.

Details

Motivation: Current speech SR methods face trade-offs: diffusion models offer high fidelity but are computationally expensive, while efficient time-domain methods lack explicit frequency representations needed for long-range spectral dependencies and harmonic alignment.

Method: STSR operates in the MDCT domain with a Spectral-Contextual Attention mechanism using hierarchical windowing to aggregate non-local spectral context for harmonic reconstruction up to 48kHz, plus sparse-aware regularization to preserve transient components.

Result: STSR consistently outperforms state-of-the-art baselines in both perceptual fidelity and zero-shot generalization, providing robust real-time performance for high-quality speech restoration.

Conclusion: STSR offers a practical, real-time solution for speech super-resolution that effectively reconciles global harmonic coherence with local transient sharpness, overcoming limitations of both diffusion models and time-domain approaches.

Abstract: Speech super-resolution (SR) reconstructs high-fidelity wideband speech from low-resolution inputs-a task that necessitates reconciling global harmonic coherence with local transient sharpness. While diffusion-based generative models yield impressive fidelity, their practical deployment is often stymied by prohibitive computational demands. Conversely, efficient time-domain architectures lack the explicit frequency representations essential for capturing long-range spectral dependencies and ensuring precise harmonic alignment. We introduce STSR, a unified end-to-end framework formulated in the MDCT domain to circumvent these limitations. STSR employs a Spectral-Contextual Attention mechanism that harnesses hierarchical windowing to adaptively aggregate non-local spectral context, enabling consistent harmonic reconstruction up to 48 kHz. Concurrently, a sparse-aware regularization strategy is employed to mitigate the suppression of transient components inherent in compressed spectral representations. STSR consistently outperforms state-of-the-art baselines in both perceptual fidelity and zero-shot generalization, providing a robust, real-time paradigm for high-quality speech restoration.

[351] Audio Super-Resolution with Latent Bridge Models

Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu

Main category: cs.SD

TL;DR: Latent bridge models for audio super-resolution achieve state-of-the-art quality for any-to-48kHz upsampling and set first record for any-to-192kHz audio SR.

Details

Motivation: Previous audio SR methods suffer from sub-optimal quality due to uninformative generation priors. The authors aim to develop a system that fully exploits instructive prior information from LR waveforms for high-quality upsampling.

Method: Compress audio into continuous latent space, use latent bridge models for latent-to-latent generation matching LR-to-HR process. Introduce frequency-aware LBMs that take prior and target frequency as input for any-to-any upsampling. Design cascaded LBMs with prior augmentation strategies for upsampling beyond 48kHz.

Result: Achieves state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals. Sets first record for any-to-192kHz audio SR. Validated on VCTK, ESC-50, Song-Describer datasets and internal testsets.

Conclusion: Latent bridge models with frequency-aware design and cascaded architecture enable high-quality audio super-resolution, unlocking upsampling beyond 48kHz and providing flexibility for audio post-production.

Abstract: Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.

[352] Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation

Shuyang Liu, Yuan Jin, Rui Lin, Shizhe Chen, Junyu Dai, Tao Jiang

Main category: cs.SD

TL;DR: HEAR is a music aesthetic evaluation framework that uses multi-scale representations, hierarchical augmentation, and hybrid training to outperform baselines on the ICASSP 2026 SongEval benchmark.

Details

Motivation: Music aesthetic evaluation is challenging due to multidimensional perception and limited labeled data, requiring robust frameworks for accurate scoring and top-tier song identification.

Method: Combines: (1) multi-source multi-scale representations for segment- and track-level features, (2) hierarchical augmentation to prevent overfitting, and (3) hybrid training with regression and ranking losses.

Result: HEAR consistently outperforms baselines across all metrics on both tracks of the ICASSP 2026 SongEval benchmark.

Conclusion: HEAR provides an effective framework for music aesthetic evaluation with superior performance, and the code/model weights are publicly available.

Abstract: Evaluating song aesthetics is challenging due to the multidimensional nature of musical perception and the scarcity of labeled data. We propose HEAR, a robust music aesthetic evaluation framework that combines: (1) a multi-source multi-scale representations module to obtain complementary segment- and track-level features, (2) a hierarchical augmentation strategy to mitigate overfitting, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-tier song identification. Experiments demonstrate that HEAR consistently outperforms the baseline across all metrics on both tracks of the ICASSP 2026 SongEval benchmark. The code and trained model weights are available at https://github.com/Eps-Acoustic-Revolution-Lab/EAR_HEAR.

[353] AUDRON: A Deep Learning Framework with Fused Acoustic Signatures for Drone Type Recognition

Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee, Deepanjali Mishra

Main category: cs.SD

TL;DR: AUDRON is a hybrid deep learning framework that uses acoustic sensing to detect drones by analyzing their distinctive propeller sounds, achieving over 97% accuracy in classification tasks.

Details

Motivation: As drones become more prevalent across various domains, their misuse raises safety and security concerns. Current detection methods like vision or radar can be expensive or intrusive, while acoustic sensing offers a low-cost, non-intrusive alternative since drone propellers generate unique sound patterns.

Method: AUDRON combines multiple feature representations: Mel-Frequency Cepstral Coefficients (MFCC) and Short-Time Fourier Transform (STFT) spectrograms processed through convolutional neural networks (CNNs) for spatial feature extraction, recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before final classification.

Result: The framework achieves 98.51% accuracy in binary classification and 97.11% accuracy in multiclass classification, effectively differentiating drone acoustic signatures from background noise while maintaining generalizability across varying conditions.

Conclusion: Combining multiple feature representations with deep learning provides reliable acoustic drone detection. AUDRON demonstrates potential for deployment in security and surveillance applications where visual or radar sensing may be limited, offering an effective low-cost alternative for drone detection.

Abstract: Unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense. While these systems provide numerous benefits, their misuse raises safety and security concerns, making effective detection mechanisms essential. Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection, as drone propellers generate distinctive sound patterns. This study introduces AUDRON (AUdio-based Drone Recognition Network), a hybrid deep learning framework for drone sound detection, employing a combination of Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs), recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before classification. Experimental evaluation demonstrates that AUDRON effectively differentiates drone acoustic signatures from background noise, achieving high accuracy while maintaining generalizability across varying conditions. AUDRON achieves 98.51 percent and 97.11 percent accuracy in binary and multiclass classification. The results highlight the advantage of combining multiple feature representations with deep learning for reliable acoustic drone detection, suggesting the framework’s potential for deployment in security and surveillance applications where visual or radar sensing may be limited.

[354] Distilled HuBERT for Mobile Speech Emotion Recognition: A Cross-Corpus Validation Study

Saifelden M. Ismail

Main category: cs.SD

TL;DR: Mobile-efficient Speech Emotion Recognition using distilled and quantized DistilHuBERT achieves 92% parameter reduction vs Wav2Vec 2.0 while maintaining 91% of baseline accuracy, enabling practical deployment on mobile devices.

Details

Motivation: State-of-the-art transformer architectures for Speech Emotion Recognition have high computational demands that constrain deployment on mobile applications, creating a need for efficient models that can run on resource-constrained devices.

Method: Uses DistilHuBERT (distilled and 8-bit quantized transformer) with rigorous 5-fold Leave-One-Session-Out cross-validation on IEMOCAP for speaker independence, augmented with cross-corpus training on CREMA-D for generalization enhancement.

Result: Achieves 61.4% Unweighted Accuracy with only 23 MB model footprint (91% of baseline accuracy). Cross-corpus training improves Weighted Accuracy by 1.2%, Macro F1 by 1.4%, and reduces variance by 32%. On RAVDESS, model clusters predictions by arousal level due to theatrical nature of acted emotions.

Conclusion: Demonstrates Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on mobile devices. Cross-corpus training enhances generalization, though theatrical emotion datasets present challenges due to acoustic saturation and arousal-based clustering.

Abstract: Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves approximately 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of the Unweighted Accuracy of a full-scale baseline. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than by specific emotion categories - happiness predictions systematically bleed into anger predictions, and sadness predictions bleed into neutral predictions, due to acoustic saturation when actors prioritize clarity over subtlety. Despite this theatricality effect reducing overall RAVDESS accuracy to 46.64%, the model maintains robust arousal detection with 99% recall for anger, 55% recall for neutral, and 27% recall for sadness. These findings demonstrate a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.

cs.LG

[355] Zero-Trust Agentic Federated Learning for Secure IIoT Defense Systems

Samaresh Kumar Singh, Joyjit Roy, Martin So

Main category: cs.LG

TL;DR: ZTA-FL is a zero-trust federated learning framework for IIoT security that combines TPM-based attestation, SHAP-weighted aggregation, and on-device adversarial training to defend against Byzantine attacks while maintaining privacy and reducing communication overhead.

Details

Motivation: Recent attacks on critical infrastructure (2021 Oldsmar water treatment, 2023 Danish energy sector) reveal urgent security gaps in Industrial IoT deployments. Existing federated learning frameworks for intrusion detection are vulnerable to Byzantine poisoning attacks and lack robust agent authentication.

Method: Zero-Trust Agentic Federated Learning (ZTA-FL) combines three key components: (1) TPM-based cryptographic attestation with extremely low false acceptance rate, (2) novel SHAP-weighted aggregation algorithm for explainable Byzantine detection under non-IID conditions with theoretical guarantees, and (3) privacy-preserving on-device adversarial training.

Result: Comprehensive experiments across three IDS benchmarks show ZTA-FL achieves 97.8% detection accuracy, 93.2% accuracy under 30% Byzantine attacks (outperforming FLAME by 3.1%, p<0.01), 89.3% adversarial robustness, and reduces communication overhead by 34%.

Conclusion: ZTA-FL provides a robust defense-in-depth framework for IIoT security that addresses critical vulnerabilities in existing federated learning approaches, offering strong Byzantine resilience, privacy preservation, and efficiency improvements with theoretical guarantees and reproducible implementation.

Abstract: Recent attacks on critical infrastructure, including the 2021 Oldsmar water treatment breach and 2023 Danish energy sector compromises, highlight urgent security gaps in Industrial IoT (IIoT) deployments. While Federated Learning (FL) enables privacy-preserving collaborative intrusion detection, existing frameworks remain vulnerable to Byzantine poisoning attacks and lack robust agent authentication. We propose Zero-Trust Agentic Federated Learning (ZTA-FL), a defense in depth framework combining: (1) TPM-based cryptographic attestation achieving less than 0.0000001 false acceptance rate, (2) a novel SHAP-weighted aggregation algorithm providing explainable Byzantine detection under non-IID conditions with theoretical guarantees, and (3) privacy-preserving on-device adversarial training. Comprehensive experiments across three IDS benchmarks (Edge-IIoTset, CIC-IDS2017, UNSW-NB15) demonstrate that ZTA-FL achieves 97.8 percent detection accuracy, 93.2 percent accuracy under 30 percent Byzantine attacks (outperforming FLAME by 3.1 percent, p less than 0.01), and 89.3 percent adversarial robustness while reducing communication overhead by 34 percent. We provide theoretical analysis, failure mode characterization, and release code for reproducibility.

[356] Network Traffic Analysis with Process Mining: The UPSIDE Case Study

Francesco Vitale, Paolo Palmiero, Massimiliano Rak, Nicola Mazzocca

Main category: cs.LG

TL;DR: Process mining method analyzes gaming network traffic to model device behavior as interpretable Petri nets and classify different video games being played.

Details

Motivation: Online gaming generates large market revenue and complex network traffic, requiring methods to model device behavior, predict loads, and detect malicious activity. Process mining offers data-driven analysis with model-based insights for gaming network traffic.

Method: Proposed process mining-based method that: 1) performs unsupervised characterization of different states from gaming network data, 2) encodes these states into interpretable Petri nets through process mining, and 3) classifies gaming network traffic to identify different video games being played.

Result: Applied to UPSIDE case study with Clash Royale and Rocket League data. Results show effective modeling with 94.02% inter-device similarity (coherence) and 174.99% inter-state separation (specificity). Achieved 73.84% AUC for classifying the two different video games.

Conclusion: Gaming network behavior can be effectively and interpretably modeled through process mining, representing states as Petri nets with good coherence and specificity while maintaining reasonable classification accuracy for identifying different games.

Abstract: Online gaming is a popular activity involving the adoption of complex systems and network infrastructures. The relevance of gaming, which generates large amounts of market revenue, drove research in modeling network devices’ behavior to evaluate bandwidth consumption, predict and sustain high loads, and detect malicious activity. In this context, process mining appears promising due to its ability to combine data-driven analyses with model-based insights. In this paper, we propose a process mining-based method that analyzes gaming network traffic, allowing: unsupervised characterization of different states from gaming network data; encoding such states through process mining into interpretable Petri nets; and classification of gaming network traffic data to identify different video games being played. We apply the method to the UPSIDE case study, involving gaming network data of several devices interacting with two video games: Clash Royale and Rocket League. Results demonstrate that the gaming network behavior can be effectively and interpretably modeled through states represented as Petri nets with sufficient coherence (94.02% inter-device similarity) and specificity (174.99% inter-state separation) while maintaining a good classification accuracy of the two different video games (73.84% AUC).

[357] A Comprehensive Study of Deep Learning Model Fixing Approaches

Hanmo You, Zan Wang, Zishuo Dong, Luanqi Mo, Jianjun Zhao, Junjie Chen

Main category: cs.LG

TL;DR: This paper conducts a large-scale empirical study evaluating 16 state-of-the-art DL model fixing approaches across model-level, layer-level, and neuron-level categories, assessing their effectiveness and impact on robustness, fairness, and backward compatibility.

Details

Motivation: DL systems are widely adopted in critical domains but prone to faults that expose users to risks. While many fixing approaches exist, there's a need for comprehensive evaluation of their performance across multiple dimensions beyond just fixing effectiveness.

Method: Conducted empirical study on 16 DL model fixing approaches spanning three categories (model-level, layer-level, neuron-level). Used diverse datasets, model architectures, and application domains within uniform experimental setup. Evaluated fixing effectiveness plus impact on robustness, fairness, and backward compatibility.

Result: Model-level approaches show superior fixing effectiveness. No single approach achieves best fixing performance while improving accuracy and maintaining all other properties. Trade-offs exist between fixing effectiveness and maintaining other critical properties.

Conclusion: Academia should prioritize research on mitigating side effects of DL model fixing approaches. The findings highlight promising directions for future exploration, emphasizing the need for approaches that balance fixing effectiveness with maintaining robustness, fairness, and backward compatibility.

Abstract: Deep Learning (DL) has been widely adopted in diverse industrial domains, including autonomous driving, intelligent healthcare, and aided programming. Like traditional software, DL systems are also prone to faults, whose malfunctioning may expose users to significant risks. Consequently, numerous approaches have been proposed to address these issues. In this paper, we conduct a large-scale empirical study on 16 state-of-the-art DL model fixing approaches, spanning model-level, layer-level, and neuron-level categories, to comprehensively evaluate their performance. We assess not only their fixing effectiveness (their primary purpose) but also their impact on other critical properties, such as robustness, fairness, and backward compatibility. To ensure comprehensive and fair evaluation, we employ a diverse set of datasets, model architectures, and application domains within a uniform experimental setup for experimentation. We summarize several key findings with implications for both industry and academia. For example, model-level approaches demonstrate superior fixing effectiveness compared to others. No single approach can achieve the best fixing performance while improving accuracy and maintaining all other properties. Thus, academia should prioritize research on mitigating these side effects. These insights highlight promising directions for future exploration in this field.

[358] A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios

Haley Rosso, Talea Mayo

Main category: cs.LG

TL;DR: A comprehensive review of diffusion-based simulation-based inference (SBI) methods that use diffusion models for posterior estimation when likelihood functions are intractable, focusing on their robustness in scientific applications with non-ideal data conditions.

Details

Motivation: Scientific inference often involves complex simulators with intractable likelihood functions, making classical likelihood-based methods infeasible. Simulation-based inference (SBI) methods are needed that can work directly with simulator samples without requiring explicit likelihoods. Diffusion models offer a promising framework for SBI due to their flexibility and robustness in handling challenging scientific data conditions.

Method: The review covers diffusion modeling foundations including forward noising processes, reverse-time SDE/ODE formulations, probability flow, and denoising score matching. It explains how conditional scores enable likelihood-free posterior sampling. The paper examines diffusion-based SBI methods including Schrodinger-bridge formulations, conditional and sequential posterior samplers, amortized architectures for unstructured data, and inference-time prior adaptation.

Result: Diffusion models address pain points of normalizing flows in neural posterior/likelihood estimation while introducing new trade-offs like iterative sampling costs. They demonstrate robustness in non-ideal scientific data conditions including model misspecification, unstructured or infinite-dimensional observations, and missing data. The framework provides accurate posteriors under appropriate conditions and caveats.

Conclusion: Diffusion-based SBI offers a flexible and robust framework for scientific inference with intractable likelihoods, particularly valuable for uncertainty quantification in probabilistic geophysical models. The review identifies open problems and emphasizes the importance of understanding conditions required for accurate posterior estimation in scientific applications.

Abstract: For complex simulation problems, inferring parameters of scientific interest often precludes the use of classical likelihood-based techniques due to intractable likelihood functions. Simulation-based inference (SBI) methods forego the need for explicit likelihoods by directly utilizing samples from the simulator to learn posterior distributions over parameters $\mathbfθ$ given observed data $\mathbf{x}_{\text{o}}$. Recent work has brought attention to diffusion models – a type of generative model rooted in score matching and reverse-time stochastic dynamics – as a flexible framework SBI tasks. This article reviews diffusion-based SBI from first principles to applications in practice. We first recall the mathematical foundations of diffusion modeling (forward noising, reverse-time SDE/ODE, probability flow, and denoising score matching) and explain how conditional scores enable likelihood-free posterior sampling. We then examine where diffusion models address pain points of normalizing flows in neural posterior/likelihood estimation and where they introduce new trade-offs (e.g., iterative sampling costs). The key theme of this review is robustness of diffusion-based SBI in non-ideal conditions common to scientific data: misspecification (mismatch between simulated training data and reality), unstructured or infinite-dimensional observations, and missingness. We synthesize methods spanning foundations drawing from Schrodinger-bridge formulations, conditional and sequential posterior samplers, amortized architectures for unstructured data, and inference-time prior adaptation. Throughout, we adopt consistent notation and emphasize conditions and caveats required for accurate posteriors. The review closes with a discussion of open problems with an eye toward applications of uncertainty quantification for probabilistic geophysical models that may benefit from diffusion-based SBI.

[359] Coordinate Matrix Machine: A Human-level Concept Learning to Classify Very Similar Documents

Amin Sadri, M Maruf Hossain

Main category: cs.LG

TL;DR: CM² is a Green AI model that achieves human-level one-shot document classification by focusing on structural coordinates rather than semantic vectors, outperforming traditional methods while being computationally efficient and explainable.

Details

Motivation: Humans learn concepts from single examples while machines need hundreds of samples. Current AI trends rely on massive pre-training and energy-intensive infrastructure, creating a need for more efficient, human-like learning approaches.

Method: Coordinate Matrix Machine (CM²) focuses on learning document structures by identifying structural “important features” similar to human cognition. It uses structural coordinates rather than exhaustive semantic vectors for classification.

Result: CM² outperforms traditional vectorizers and complex deep learning models, achieving high accuracy with minimal data (one-shot learning) while being computationally efficient and environmentally sustainable.

Conclusion: CM² demonstrates that human-level concept learning can be achieved through structural intelligence, offering a Green AI alternative to energy-intensive models with advantages in efficiency, explainability, and economic viability.

Abstract: Human-level concept learning argues that humans typically learn new concepts from a single example, whereas machine learning algorithms typically require hundreds of samples to learn a single concept. Our brain subconsciously identifies important features and learns more effectively. \vspace*{6pt} Contribution: In this paper, we present the Coordinate Matrix Machine (CM$^2$). This purpose-built small model augments human intelligence by learning document structures and using this information to classify documents. While modern “Red AI” trends rely on massive pre-training and energy-intensive GPU infrastructure, CM$^2$ is designed as a Green AI solution. It achieves human-level concept learning by identifying only the structural “important features” a human would consider, allowing it to classify very similar documents using only one sample per class. Advantage: Our algorithm outperforms traditional vectorizers and complex deep learning models that require larger datasets and significant compute. By focusing on structural coordinates rather than exhaustive semantic vectors, CM$^2$ offers: 1. High accuracy with minimal data (one-shot learning) 2. Geometric and structural intelligence 3. Green AI and environmental sustainability 4. Optimized for CPU-only environments 5. Inherent explainability (glass-box model) 6. Faster computation and low latency 7. Robustness against unbalanced classes 8. Economic viability 9. Generic, expandable, and extendable

[360] Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang

Main category: cs.LG

TL;DR: Skip-BART is an end-to-end generative model that predicts human-like stage lighting from audio music, treating automatic stage lighting control as a generative task rather than classification.

Details

Motivation: Existing ASLC solutions are limited to classifying music into categories and mapping to predefined light patterns, resulting in formulaic, monotonous outcomes that lack the creativity and rationality of professional lighting engineers.

Method: Adapts BART model to take audio music as input and produce light hue/intensity output, incorporates novel skip connection mechanism to enhance music-light relationships, creates first stage lighting dataset, and uses pre-training/transfer learning techniques for limited data.

Result: Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only limited gap compared to real lighting engineers, validated through both quantitative analysis and human evaluation.

Conclusion: This work successfully conceptualizes ASLC as a generative task, presents the first stage lighting dataset, and demonstrates that end-to-end learning from experienced lighting engineers can produce vivid, human-like stage lighting.

Abstract: Stage lighting is a vital component in live music performances, shaping an engaging experience for both musicians and audiences. In recent years, Automatic Stage Lighting Control (ASLC) has attracted growing interest due to the high costs of hiring or training professional lighting engineers. However, most existing ASLC solutions only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this gap, this paper presents Skip-BART, an end-to-end model that directly learns from experienced lighting engineers and predict vivid, human-like stage lighting. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method adapts the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame grid. To address the lack of available datasets, we create the first stage lighting dataset, along with several pre-training and transfer learning techniques to improve model training with limited data. We validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting engineers. To support further research, we have made our self-collected dataset, code, and trained model parameters available at https://github.com/RS2002/Skip-BART .

[361] Geometric Scaling of Bayesian Inference in LLMs

Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Large language models preserve geometric structures that enable Bayesian inference, with value representations organizing along an entropy-aligned axis that correlates with predictive uncertainty.

Details

Motivation: To investigate whether geometric signatures observed in small transformers trained in controlled settings (showing exact Bayesian inference capabilities) persist in production-grade language models, and to understand the role of this geometry in uncertainty representation.

Method: Analyzed multiple language model families (Pythia, Phi-2, Llama-3, Mistral) to examine value representations. Performed targeted interventions on the entropy-aligned axis in Pythia-410M during in-context learning, comparing effects of removing/perturbing this axis versus random-axis interventions.

Result: Found that last-layer value representations organize along a single dominant axis strongly correlated with predictive entropy. Domain-restricted prompts collapse this structure into low-dimensional manifolds similar to synthetic settings. Interventions on the entropy-aligned axis selectively disrupt local uncertainty geometry, but don’t proportionally degrade Bayesian-like behavior.

Conclusion: Modern language models preserve the geometric substrate enabling Bayesian inference observed in synthetic settings, organizing approximate Bayesian updates along this substrate. The geometry serves as a privileged readout of uncertainty rather than a singular computational bottleneck.

Abstract: Recent work has shown that small transformers trained in controlled “wind-tunnel’’ settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate – low-dimensional value manifolds and progressively orthogonal keys – that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

[362] Generalized Regularized Evidential Deep Learning Models: Theory and Comprehensive Evaluation

Deep Shankar Pandey, Hyomin Choi, Qi Yu

Main category: cs.LG

TL;DR: EDL models using Subjective Logic face activation-dependent learning-freeze issues due to non-negative evidence constraints; authors propose new activation functions and regularizers to solve this.

Details

Motivation: Evidential deep learning models based on Subjective Logic provide efficient uncertainty quantification but suffer from activation-dependent learning-freeze behavior where gradients become extremely small in low-evidence regions, hindering training.

Method: Theoretical analysis of learning dynamics with different evidential activations, followed by design of a general family of activation functions and corresponding evidential regularizers that enable consistent evidence updates across activation regimes.

Result: Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet), two few-shot classification problems, and blind face restoration validate the theory and demonstrate effectiveness of proposed generalized regularized evidential models.

Conclusion: The proposed generalized activation functions and regularizers successfully address the learning-freeze problem in evidential deep learning, enabling more stable training and improved uncertainty quantification across diverse applications.

Abstract: Evidential deep learning (EDL) models, based on Subjective Logic, introduce a principled and computationally efficient way to make deterministic neural networks uncertainty-aware. The resulting evidential models can quantify fine-grained uncertainty using learned evidence. However, the Subjective-Logic framework constrains evidence to be non-negative, requiring specific activation functions whose geometric properties can induce activation-dependent learning-freeze behavior: a regime where gradients become extremely small for samples mapped into low-evidence regions. We theoretically characterize this behavior and analyze how different evidential activations influence learning dynamics. Building on this analysis, we design a general family of activation functions and corresponding evidential regularizers that provide an alternative pathway for consistent evidence updates across activation regimes. Extensive experiments on four benchmark classification problems (MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet), two few-shot classification problems, and blind face restoration problem empirically validate the developed theory and demonstrate the effectiveness of the proposed generalized regularized evidential models.

[363] HINTS: Extraction of Human Insights from Time-Series Without External Sources

Sheo Yon Jhin, Noseong Park

Main category: cs.LG

TL;DR: HINTS is a self-supervised framework that extracts latent human factors from time series residuals without external data, using opinion dynamics as inductive bias to improve forecasting accuracy.

Details

Motivation: Human decision-making, emotions, and collective psychology shape financial/economic systems, but current models rely on expensive external data (news, social media) with high financial, computational, and practical costs.

Method: HINTS uses Friedkin-Johnsen opinion dynamics model as structural inductive bias to extract latent human factors (social influence, memory, bias) from time series residuals. These factors are integrated into a state-of-the-art backbone model as an attention map.

Result: Experiments on nine real-world and benchmark datasets show HINTS consistently improves forecasting accuracy. Case studies validate interpretability with strong semantic alignment between extracted factors and real-world events.

Conclusion: HINTS provides a practical, cost-effective alternative to external data approaches by endogenously extracting human factors from time series residuals, improving both accuracy and interpretability.

Abstract: Human decision-making, emotions, and collective psychology are complex factors that shape the temporal dynamics observed in financial and economic systems. Many recent time series forecasting models leverage external sources (e.g., news and social media) to capture human factors, but these approaches incur high data dependency costs in terms of financial, computational, and practical implications. In this study, we propose HINTS, a self-supervised learning framework that extracts these latent factors endogenously from time series residuals without external data. HINTS leverages the Friedkin-Johnsen (FJ) opinion dynamics model as a structural inductive bias to model evolving social influence, memory, and bias patterns. The extracted human factors are integrated into a state-of-the-art backbone model as an attention map. Experimental results using nine real-world and benchmark datasets demonstrate that HINTS consistently improves forecasting accuracy. Furthermore, multiple case studies and ablation studies validate the interpretability of HINTS, demonstrating strong semantic alignment between the extracted factors and real-world events, demonstrating the practical utility of HINTS.

[364] Learning Coupled System Dynamics under Incomplete Physical Constraints and Missing Data

Esha Saha, Hao Wang

Main category: cs.LG

TL;DR: MUSIC is a sparsity-induced multitask neural network framework that integrates partial physical constraints with data-driven learning to recover full-dimensional solutions of coupled systems when physics and data are available for different variables.

Details

Motivation: Many complex systems have coupled variables, but governing equations are typically known for only one variable while others are only accessible through data. This mismatch between known physics and observed data poses a fundamental challenge for existing physics-informed machine learning approaches that assume either complete equation knowledge or full data availability across all variables.

Method: MUSIC uses a sparsity-induced multitask neural network framework with mesh-free random sampling of training data and sparsity regularization. It integrates partial physical constraints with data-driven learning to handle mutually exclusive physics-constrained and data-informed variables in coupled systems.

Result: MUSIC accurately learns solutions to complex coupled systems (shock wave solutions, discontinuous solutions, pattern formation solutions) under data-scarce and noisy conditions, consistently outperforming non-sparse formulations. It yields highly compressed models with improved training and evaluation efficiency.

Conclusion: MUSIC provides a flexible and effective approach for modeling partially observed systems with incomplete physical knowledge, addressing the fundamental challenge of mismatched physics and data availability in coupled systems.

Abstract: Advances in data acquisition and computational methods have accelerated the use of differential equation based modelling for complex systems. Such systems are often described by coupled (or more) variables, yet governing equation is typically available for one variable, while the remaining variable can be accessed only through data. This mismatch between known physics and observed data poses a fundamental challenge for existing physics-informed machine learning approaches, which generally assume either complete knowledge of the governing equations or full data availability across all variables. In this paper, we introduce MUSIC (Multitask Learning Under Sparse and Incomplete Constraints), a sparsity induced multitask neural network framework that integrates partial physical constraints with data-driven learning to recover full-dimensional solutions of coupled systems when physics-constrained and data-informed variables are mutually exclusive. MUSIC employs mesh-free (random) sampling of training data and sparsity regularization, yielding highly compressed models with improved training and evaluation efficiency. We demonstrate that MUSIC accurately learns solutions (shock wave solutions, discontinuous solutions, pattern formation solutions) to complex coupled systems under data-scarce and noisy conditions, consistently outperforming non-sparse formulations. These results highlight MUSIC as a flexible and effective approach for modeling partially observed systems with incomplete physical knowledge.

Zijian Zhao, Sen Li

Main category: cs.LG

TL;DR: Triple-BERT is a centralized single-agent reinforcement learning method for large-scale ride-sharing order dispatching that uses action decomposition and BERT-based networks to handle vast action/observation spaces, achieving significant improvements over state-of-the-art methods.

Details

Motivation: Current MARL approaches for ride-sharing order dispatching have limitations: independent MARL lacks global information and cooperation, while CTDE MARL suffers from dimensionality issues due to large numbers of drivers and orders creating extensive observation spaces.

Method: Proposes Triple-BERT, a centralized single-agent RL method based on TD3 variant. Uses action decomposition to break joint action probability into individual driver actions. Employs BERT-based network with parameter reuse to handle large observation space, using attention mechanisms to capture complex driver-order relationships.

Result: Validated on real-world Manhattan ride-hailing dataset. Achieves 11.95% improvement over state-of-the-art methods, with 4.26% increase in served orders and 22.25% reduction in pickup times.

Conclusion: Triple-BERT effectively addresses large-scale order dispatching challenges by combining centralized single-agent RL with action decomposition and BERT-based architecture, outperforming existing MARL approaches while being computationally efficient.

Abstract: On-demand ride-sharing platforms, such as Uber and Lyft, face the intricate real-time challenge of bundling and matching passengers-each with distinct origins and destinations-to available vehicles, all while navigating significant system uncertainties. Due to the extensive observation space arising from the large number of drivers and orders, order dispatching, though fundamentally a centralized task, is often addressed using Multi-Agent Reinforcement Learning (MARL). However, independent MARL methods fail to capture global information and exhibit poor cooperation among workers, while Centralized Training Decentralized Execution (CTDE) MARL methods suffer from the curse of dimensionality. To overcome these challenges, we propose Triple-BERT, a centralized Single Agent Reinforcement Learning (MARL) method designed specifically for large-scale order dispatching on ride-sharing platforms. Built on a variant TD3, our approach addresses the vast action space through an action decomposition strategy that breaks down the joint action probability into individual driver action probabilities. To handle the extensive observation space, we introduce a novel BERT-based network, where parameter reuse mitigates parameter growth as the number of drivers and orders increases, and the attention mechanism effectively captures the complex relationships among the large pool of driver and orders. We validate our method using a real-world ride-hailing dataset from Manhattan. Triple-BERT achieves approximately an 11.95% improvement over current state-of-the-art methods, with a 4.26% increase in served orders and a 22.25% reduction in pickup times. Our code, trained model parameters, and processed data are publicly available at the repository https://github.com/RS2002/Triple-BERT .

[366] Drift-Based Dataset Stability Benchmark

Dominik Soukup, Richard Plný, Daniel Vašata, Tomáš Čejka

Main category: cs.LG

TL;DR: The paper proposes a methodology to evaluate dataset stability and a benchmarking workflow for network traffic classification datasets to address concept drift issues.

Details

Motivation: ML models for network traffic classification degrade after deployment due to concept drift from obsolete datasets and evolving network protocols. Current practice involves complete retraining without investigating root causes, assuming good dataset quality, which isn't always true.

Method: A novel methodology based on concept drift detection that uses ML feature weights to boost detection performance, plus a benchmark workflow to compare datasets and evaluate stability.

Result: Demonstrated benefits on CESNET-TLS-Year22 dataset, providing initial dataset stability benchmark to describe stability and identify weak points for optimization. Showed optimization impact on created dataset variants using the proposed methodology.

Conclusion: The proposed framework enables systematic evaluation of dataset stability and comparison of datasets, helping identify optimization needs and measure improvement impact in network traffic classification.

Abstract: Machine learning (ML) represents an efficient and popular approach for network traffic classification. However, network traffic classification is a challenging domain, and trained models may degrade soon after deployment due to the obsolete datasets and quick evolution of computer networks as new or updated protocols appear. Moreover, significant change in the behavior of a traffic type (and, therefore, the underlying features representing the traffic) can produce a large and sudden performance drop of the deployed model, known as a data or concept drift. In most cases, complete retraining is performed, often without further investigation of root causes, as good dataset quality is assumed. However, this is not always the case and further investigation must be performed. This paper proposes a novel methodology to evaluate the stability of datasets and a benchmark workflow that can be used to compare datasets. The proposed framework is based on a concept drift detection method that also uses ML feature weights to boost the detection performance. The benefits of this work are demonstrated on CESNET-TLS-Year22 dataset. We provide the initial dataset stability benchmark that is used to describe dataset stability and weak points to identify the next steps for optimization. Lastly, using the proposed benchmarking methodology, we show the optimization impact on the created dataset variants.

[367] Federated Multi-Task Clustering

Suyan Dai, Gan Sun, Fazeng Li, Xu Tang, Qianqian Wang, Yang Cong

Main category: cs.LG

TL;DR: FMTC is a federated multi-task clustering framework that learns personalized clustering models for heterogeneous clients while capturing shared knowledge via tensor low-rank regularization, outperforming existing federated clustering methods.

Details

Motivation: Existing spectral clustering models are centralized and inapplicable to decentralized environments. Current federated learning approaches suffer from poor generalization due to unreliable pseudo-labels and fail to capture correlations among heterogeneous clients.

Method: FMTC has two components: 1) Client-side personalized clustering module that learns parameterized mapping models for robust out-of-sample inference without pseudo-labels, and 2) Server-side tensorial correlation module that organizes client models into a tensor with low-rank regularization to discover common subspace. Uses ADMM-based distributed algorithm for privacy-preserving optimization.

Result: Extensive experiments on multiple real-world datasets show FMTC significantly outperforms various baseline and state-of-the-art federated clustering algorithms.

Conclusion: FMTC effectively addresses limitations of existing federated clustering by learning personalized models while capturing shared structure, demonstrating superior performance through privacy-preserving distributed optimization.

Abstract: Spectral clustering has emerged as one of the most effective clustering algorithms due to its superior performance. However, most existing models are designed for centralized settings, rendering them inapplicable in modern decentralized environments. Moreover, current federated learning approaches often suffer from poor generalization performance due to reliance on unreliable pseudo-labels, and fail to capture the latent correlations amongst heterogeneous clients. To tackle these limitations, this paper proposes a novel framework named Federated Multi-Task Clustering (i.e.,FMTC), which intends to learn personalized clustering models for heterogeneous clients while collaboratively leveraging their shared underlying structure in a privacy-preserving manner. More specifically, the FMTC framework is composed of two main components: client-side personalized clustering module, which learns a parameterized mapping model to support robust out-of-sample inference, bypassing the need for unreliable pseudo-labels; and server-side tensorial correlation module, which explicitly captures the shared knowledge across all clients. This is achieved by organizing all client models into a unified tensor and applying a low-rank regularization to discover their common subspace. To solve this joint optimization problem, we derive an efficient, privacy-preserving distributed algorithm based on the Alternating Direction Method of Multipliers, which decomposes the global problem into parallel local updates on clients and an aggregation step on the server. To the end, several extensive experiments on multiple real-world datasets demonstrate that our proposed FMTC framework significantly outperforms various baseline and state-of-the-art federated clustering algorithms.

[368] Neural Optimal Design of Experiment for Inverse Problems

John E. Darges, Babak Maboudi Afkham, Matthias Chung

Main category: cs.LG

TL;DR: NODE is a learning-based framework for optimal experimental design that jointly trains neural reconstruction models with continuous design variables, avoiding traditional bilevel optimization and sparsity regularization.

Details

Motivation: Traditional optimal experimental design methods rely on bilevel optimization and indirect sparsity regularization, which can be computationally expensive and require careful tuning of regularization parameters.

Method: NODE jointly trains a neural reconstruction model and continuous design variables (sensor locations, sampling times, measurement angles) in a single optimization loop, enforcing sparsity by design rather than through regularization.

Result: NODE outperforms baseline approaches on exponential growth benchmark, MNIST image sampling, and sparse view X-ray CT, showing improved reconstruction accuracy and task-specific performance.

Conclusion: NODE provides an effective, computationally efficient alternative to classical optimal experimental design methods by directly optimizing measurement locations and eliminating the need for sparsity regularization tuning.

Abstract: We introduce Neural Optimal Design of Experiments, a learning-based framework for optimal experimental design in inverse problems that avoids classical bilevel optimization and indirect sparsity regularization. NODE jointly trains a neural reconstruction model and a fixed-budget set of continuous design variables representing sensor locations, sampling times, or measurement angles, within a single optimization loop. By optimizing measurement locations directly rather than weighting a dense grid of candidates, the proposed approach enforces sparsity by design, eliminates the need for l1 tuning, and substantially reduces computational complexity. We validate NODE on an analytically tractable exponential growth benchmark, on MNIST image sampling, and illustrate its effectiveness on a real world sparse view X ray CT example. In all cases, NODE outperforms baseline approaches, demonstrating improved reconstruction accuracy and task-specific performance.

[369] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu

Main category: cs.LG

TL;DR: KernelEvolve is an agentic kernel coding framework that automates kernel generation and optimization for deep learning recommendation models across heterogeneous hardware architectures, reducing development time from weeks to hours while achieving substantial performance improvements.

Details

Motivation: Deep learning recommendation model (DLRM) training and inference face three key system challenges: model architecture diversity, kernel primitive diversity, and hardware generation/architecture heterogeneity. These challenges make manual kernel optimization difficult and time-consuming across diverse hardware platforms.

Method: KernelEvolve takes kernel specifications as input and automates kernel generation/optimization through graph-based search with selection policy, universal operator, fitness function, and termination rule. It operates at multiple programming abstractions (Triton, CuTe DSL to low-level hardware agnostic languages) and uses retrieval-augmented prompt synthesis to dynamically adapt to runtime execution context.

Result: Achieved 100% pass rate on all 250 problems in KernelBench suite across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms with 100% correctness. Reduced development time from weeks to hours and achieved substantial performance improvements over PyTorch baselines across diverse production use cases.

Conclusion: KernelEvolve successfully addresses hardware heterogeneity challenges for DLRM systems, significantly mitigates programmability barriers for new AI hardware, and enables automated kernel generation for in-house developed AI accelerators while delivering substantial efficiency improvements.

Abstract: Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

[370] Exploring Cumulative Effects in Survival Data Using Deep Learning Networks

Kang-Chung Yang, Shinsheng Yuan

Main category: cs.LG

TL;DR: CENNSurv is a deep learning approach for survival analysis that models cumulative effects of time-dependent exposures with improved scalability and interpretability compared to traditional methods.

Details

Motivation: Existing methods have limitations: conventional spline-based approaches require repeated data transformation and struggle with large datasets, while neural network methods focus on accuracy but lack interpretability of cumulative exposure patterns.

Method: CENNSurv (Cumulative Exposure Neural Network for Survival analysis) uses deep learning to capture dynamic risk relationships from time-dependent data, providing interpretable insights into cumulative exposure effects.

Result: On two real-world datasets, CENNSurv revealed a multi-year lagged association between chronic environmental exposure and survival outcomes, and identified critical short-term behavioral shifts prior to subscription lapse, demonstrating ability to model complex temporal patterns with improved scalability.

Conclusion: CENNSurv provides researchers with a practical tool for studying cumulative effects of time-dependent exposures, offering both scalability for large datasets and interpretable insights into temporal patterns.

Abstract: In epidemiological research, modeling the cumulative effects of time-dependent exposures on survival outcomes presents a challenge due to their intricate temporal dynamics. Conventional spline-based statistical methods, though effective, require repeated data transformation for each spline parameter tuning, with survival analysis computations relying on the entire dataset, posing difficulties for large datasets. Meanwhile, existing neural network-based survival analysis methods focus on accuracy but often overlook the interpretability of cumulative exposure patterns. To bridge this gap, we introduce CENNSurv, a novel deep learning approach that captures dynamic risk relationships from time-dependent data. Evaluated on two diverse real-world datasets, CENNSurv revealed a multi-year lagged association between chronic environmental exposure and a critical survival outcome, as well as a critical short-term behavioral shift prior to subscription lapse. This demonstrates CENNSurv’s ability to model complex temporal patterns with improved scalability. CENNSurv provides researchers studying cumulative effects a practical tool with interpretable insights.

[371] A Granular Grassmannian Clustering Framework via the Schubert Variety of Best Fit

Karim Salta, Michael Kirby, Chris Peterson

Main category: cs.LG

TL;DR: SVBF-LBG: Subspace clustering using Schubert Variety prototypes instead of means, improving cluster purity while maintaining geometric structure.

Details

Motivation: Need better geometric representatives for subspace datasets that preserve mathematical structure while improving clustering performance. Traditional subspace means on Grassmann/flag manifolds may not capture optimal cluster characteristics.

Method: Introduces Schubert Variety of Best Fit (SVBF) as trainable prototypes that intersect each cluster member in at least one fixed direction. Integrates SVBF into Linde-Buzo-Grey (LBG) clustering pipeline for subspace data.

Result: SVBF-LBG achieves improved cluster purity on synthetic, image, spectral, and video action datasets compared to traditional methods, while retaining mathematical structure needed for downstream analysis.

Conclusion: SVBF prototypes provide superior geometric representatives for subspace clustering, offering both performance improvements and mathematical structure preservation for subsequent analysis tasks.

Abstract: In many classification and clustering tasks, it is useful to compute a geometric representative for a dataset or a cluster, such as a mean or median. When datasets are represented by subspaces, these representatives become points on the Grassmann or flag manifold, with distances induced by their geometry, often via principal angles. We introduce a subspace clustering algorithm that replaces subspace means with a trainable prototype defined as a Schubert Variety of Best Fit (SVBF) - a subspace that comes as close as possible to intersecting each cluster member in at least one fixed direction. Integrated in the Linde-Buzo-Grey (LBG) pipeline, this SVBF-LBG scheme yields improved cluster purity on synthetic, image, spectral, and video action data, while retaining the mathematical structure required for downstream analysis.

[372] Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics

Bin Xu, Ayan Banerjee, Sandeep Gupta

Main category: cs.LG

TL;DR: MERINDA is an FPGA-accelerated framework for model recovery that achieves 114× lower energy, 28× smaller memory, and 1.68× faster training than GPU implementations while matching accuracy.

Details

Motivation: Physical AI at the edge requires hardware-efficient learning for real-time understanding of real-world dynamics. Current model recovery methods rely on Neural ODEs that are computationally expensive and difficult to accelerate on edge hardware with strict latency, compute, and power constraints.

Method: MERINDA replaces Neural ODE components with a hardware-friendly formulation combining: (1) GRU-based discretized dynamics, (2) dense inverse-ODE layers, (3) sparsity-driven dropout, and (4) lightweight ODE solvers. The computation is structured for streaming parallelism to enable full parallelization on FPGAs.

Result: Across four benchmark nonlinear dynamical systems, MERINDA delivers 114× lower energy (434J vs. 49,375J), 28× smaller memory footprint (214MB vs. 6,118MB), and 1.68× faster training while matching state-of-the-art model-recovery accuracy.

Conclusion: MERINDA enables accurate, explainable model recovery at the edge for real-time monitoring of autonomous systems, making physical AI practical on resource-constrained devices.

Abstract: Physical AI at the edge – enabling autonomous systems to understand and predict real-world dynamics in real time – requires hardware-efficient learning and inference. Model recovery (MR), which identifies governing equations from sensor data, is a key primitive for safe and explainable monitoring in mission-critical autonomous systems operating under strict latency, compute, and power constraints. However, state-of-the-art MR methods (e.g., EMILY and PINN+SR) rely on Neural ODE formulations that require iterative solvers and are difficult to accelerate efficiently on edge hardware. We present \textbf{MERINDA} (Model Recovery in Reconfigurable Dynamic Architecture), an FPGA-accelerated MR framework designed to make physical AI practical on resource-constrained devices. MERINDA replaces expensive Neural ODE components with a hardware-friendly formulation that combines (i) GRU-based discretized dynamics, (ii) dense inverse-ODE layers, (iii) sparsity-driven dropout, and (iv) lightweight ODE solvers. The resulting computation is structured for streaming parallelism, enabling critical kernels to be fully parallelized on the FPGA. Across four benchmark nonlinear dynamical systems, MERINDA delivers substantial gains over GPU implementations: \textbf{114$\times$ lower energy} (434~~J vs.\ 49{,}375~~J), \textbf{28$\times$ smaller memory footprint} (214~~MB vs.\ 6{,}118~~MB), and \textbf{1.68$\times$ faster training}, while matching state-of-the-art model-recovery accuracy. These results demonstrate that MERINDA can bring accurate, explainable MR to the edge for real-time monitoring of autonomous systems.

[373] Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

Ankit Kanwar, Dominik Wagner, Luke Ong

Main category: cs.LG

TL;DR: SB-TRPO is a trust-region RL algorithm that balances reward maximization with strict safety constraints by adaptively biasing policy updates toward constraint satisfaction while maintaining reward improvement.

Details

Motivation: Existing RL methods for safety-critical domains either fail to ensure near-zero safety violations or sacrifice reward performance when dealing with hard constraints. There's a need for algorithms that can maximize rewards while strictly adhering to safety constraints.

Method: Safety-Biased Trust Region Policy Optimisation (SB-TRPO) adaptively biases policy updates toward constraint satisfaction while seeking reward improvement. It performs trust-region updates using a convex combination of natural policy gradients for cost and reward, ensuring a fixed fraction of optimal cost reduction at each step.

Result: Experiments on standard and challenging Safety Gymnasium tasks show SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.

Conclusion: SB-TRPO provides a theoretically-grounded approach for hard-constrained RL that effectively balances safety and performance, with theoretical guarantees of local progress toward safety and reward improvement when gradients are aligned.

Abstract: Reinforcement learning (RL) in safety-critical domains requires agents to maximise rewards while strictly adhering to safety constraints. Existing approaches, such as Lagrangian and projection-based methods, often either fail to ensure near-zero safety violations or sacrifice reward performance in the face of hard constraints. We propose Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a new trust-region algorithm for hard-constrained RL. SB-TRPO adaptively biases policy updates towards constraint satisfaction while still seeking reward improvement. Concretely, it performs trust-region updates using a convex combination of the natural policy gradients of cost and reward, ensuring a fixed fraction of optimal cost reduction at each step. We provide a theoretical guarantee of local progress towards safety, with reward improvement when gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks show that SB-TRPO consistently achieves the best balance of safety and meaningful task completion compared to state-of-the-art methods.

[374] FineFT: Efficient and Risk-Aware Ensemble Reinforcement Learning for Futures Trading

Molei Qin, Xinyu Cai, Yewen Li, Haochong Xia, Chuqiao Zong, Shuo Sun, Xinrun Wang, Bo An

Main category: cs.LG

TL;DR: FineFT is a three-stage ensemble RL framework for crypto futures trading that addresses high leverage challenges through selective ensemble updates, profitability filtering, and VAE-guided risk management to handle new market states.

Details

Motivation: Existing RL methods for quantitative trading focus on spot markets and fail in futures markets due to two key challenges: 1) high leverage amplifies reward fluctuations making training unstable, and 2) lack of self-awareness about capability boundaries exposes systems to significant losses during black swan events.

Method: Three-stage ensemble RL framework: Stage I - ensemble Q learners selectively updated by ensemble TD errors for stable convergence; Stage II - filter Q-learners based on profitability and train VAEs on market states to identify capability boundaries; Stage III - choose between filtered ensemble and conservative policy guided by VAEs to balance profitability and risk mitigation.

Result: FineFT outperforms 12 SOTA baselines in 6 financial metrics on crypto futures with 5x leverage, reducing risk by more than 40% while achieving superior profitability compared to runner-up. Visualization shows agents specialize in distinct market dynamics, and ablation studies confirm VAE routing reduces maximum drawdown while selective update improves convergence.

Conclusion: The proposed FineFT framework successfully addresses the challenges of high-leverage futures trading through ensemble RL with stable training mechanisms and VAE-guided risk management, demonstrating both improved profitability and significant risk reduction in high-frequency crypto futures trading.

Abstract: Futures are contracts obligating the exchange of an asset at a predetermined date and price, notable for their high leverage and liquidity and, therefore, thrive in the Crypto market. RL has been widely applied in various quantitative tasks. However, most methods focus on the spot and could not be directly applied to the futures market with high leverage because of 2 challenges. First, high leverage amplifies reward fluctuations, making training stochastic and difficult to converge. Second, prior works lacked self-awareness of capability boundaries, exposing them to the risk of significant loss when encountering new market state (e.g.,a black swan event like COVID-19). To tackle these challenges, we propose the Efficient and Risk-Aware Ensemble Reinforcement Learning for Futures Trading (FineFT), a novel three-stage ensemble RL framework with stable training and proper risk management. In stage I, ensemble Q learners are selectively updated by ensemble TD errors to improve convergence. In stage II, we filter the Q-learners based on their profitabilities and train VAEs on market states to identify the capability boundaries of the learners. In stage III, we choose from the filtered ensemble and a conservative policy, guided by trained VAEs, to maintain profitability and mitigate risk with new market states. Through extensive experiments on crypto futures in a high-frequency trading environment with high fidelity and 5x leverage, we demonstrate that FineFT outperforms 12 SOTA baselines in 6 financial metrics, reducing risk by more than 40% while achieving superior profitability compared to the runner-up. Visualization of the selective update mechanism shows that different agents specialize in distinct market dynamics, and ablation studies certify routing with VAEs reduces maximum drawdown effectively, and selective update improves convergence and performance.

[375] A Survey on Graph Neural Networks for Fraud Detection in Ride Hailing Platforms

Kanishka Hewageegana, Janani Harischandra, Nipuna Senanayake, Gihan Danansuriya, Kavindu Hapuarachchi, Pooja Illangarathne

Main category: cs.LG

TL;DR: The paper investigates fraud detection in ride-hailing platforms using Graph Neural Networks (GNNs), comparing various models and addressing challenges like class imbalance and fraudulent camouflage.

Details

Motivation: The motivation is to address fraudulent activities in online ride-hailing platforms by leveraging GNNs, as fraud detection is crucial for platform security and user trust in the rapidly evolving ride-hailing industry.

Method: The method involves analyzing and comparing various Graph Neural Network models for fraud detection, with a focus on addressing class imbalance and fraudulent camouflage issues. The paper provides a structured overview of GNN architectures and methodologies applied to anomaly detection.

Result: The research identifies significant methodological progress and gaps in GNN-based fraud detection, highlighting the effectiveness of various models while noting the need for further exploration into real-world applicability.

Conclusion: The paper concludes that while GNNs show promise for fraud detection in ride-hailing platforms, further research is needed to enhance real-world applicability and make technical improvements to fraud detection strategies in this evolving industry.

Abstract: This study investigates fraud detection in ride hailing platforms through Graph Neural Networks (GNNs),focusing on the effectiveness of various models. By analyzing prevalent fraudulent activities, the research highlights and compares the existing work related to fraud detection which can be useful when addressing fraudulent incidents within the online ride hailing platforms. Also, the paper highlights addressing class imbalance and fraudulent camouflage. It also outlines a structured overview of GNN architectures and methodologies applied to anomaly detection, identifying significant methodological progress and gaps. The paper calls for further exploration into real-world applicability and technical improvements to enhance fraud detection strategies in the rapidly evolving ride-hailing industry.

[376] TabMixNN: A Unified Deep Learning Framework for Structural Mixed Effects Modeling on Tabular Data

Deniz Akdemir

Main category: cs.LG

TL;DR: TabMixNN is a PyTorch framework that combines mixed-effects models with neural networks for tabular data, supporting hierarchical structures, multiple outcome types, and interpretability tools.

Details

Motivation: Address the need for methods that can handle hierarchical data structures while supporting diverse outcome types (regression, classification, multitask learning) and bridge the gap between classical mixed-effects modeling and modern deep learning approaches.

Method: Modular three-stage architecture: (1) mixed-effects encoder with variational random effects and flexible covariance structures, (2) backbone architectures including GSEM and spatial-temporal manifold networks, (3) outcome-specific prediction heads for multiple outcome families. Includes R-style formula interface, DAG constraints for causal learning, SPDE kernels for spatial modeling, and interpretability tools.

Result: Demonstrated framework flexibility through applications to longitudinal data analysis, genomic prediction, and spatial-temporal modeling. Provides unified interface for leveraging deep learning while maintaining interpretability and theoretical grounding of classical mixed-effects models.

Conclusion: TabMixNN successfully synthesizes classical mixed-effects modeling with modern neural networks, offering researchers a flexible, interpretable framework for tabular data analysis that handles hierarchical structures and diverse outcome types while maintaining accessibility through familiar interfaces.

Abstract: We present TabMixNN, a flexible PyTorch-based deep learning framework that synthesizes classical mixed-effects modeling with modern neural network architectures for tabular data analysis. TabMixNN addresses the growing need for methods that can handle hierarchical data structures while supporting diverse outcome types including regression, classification, and multitask learning. The framework implements a modular three-stage architecture: (1) a mixed-effects encoder with variational random effects and flexible covariance structures, (2) backbone architectures including Generalized Structural Equation Models (GSEM) and spatial-temporal manifold networks, and (3) outcome-specific prediction heads supporting multiple outcome families. Key innovations include an R-style formula interface for accessibility, support for directed acyclic graph (DAG) constraints for causal structure learning, Stochastic Partial Differential Equation (SPDE) kernels for spatial modeling, and comprehensive interpretability tools including SHAP values and variance decomposition. We demonstrate the framework’s flexibility through applications to longitudinal data analysis, genomic prediction, and spatial-temporal modeling. TabMixNN provides a unified interface for researchers to leverage deep learning while maintaining the interpretability and theoretical grounding of classical mixed-effects models.

[377] Improved Bounds for Private and Robust Alignment

Wenqian Weng, Yi He, Xingyu Zhou

Main category: cs.LG

TL;DR: Theoretical analysis of private and robust alignment of language models, establishing upper bounds on suboptimality gaps in offline/online settings under privacy constraints and adversarial corruption.

Details

Motivation: To provide theoretical understanding of language model alignment under both privacy constraints and adversarial corruption, analyzing the interplay between these two challenges in preference learning settings.

Method: Established theoretical upper bounds using uniform convergence guarantees for log loss and square loss under privacy and corruption. Analyzed two scenarios: privacy-first and corruption-first interplays. Used MLE-style algorithms with log loss for privacy-only setting, and extended analysis to existing offline algorithms for joint privacy-corruption setting.

Result: For privacy-only setting: log loss with MLE-style algorithm achieves near-optimal rates. For joint setting: existing offline algorithms provide stronger guarantees than previously known, with improved bounds in corruption-only regime. First results presented for private and robust online alignment.

Conclusion: The paper provides theoretical foundations for private and robust language model alignment, demonstrating that log loss with MLE-style algorithms can achieve near-optimal rates under privacy constraints, and that existing algorithms offer stronger guarantees than previously recognized in joint privacy-corruption settings.

Abstract: In this paper, we study the private and robust alignment of language models from a theoretical perspective by establishing upper bounds on the suboptimality gap in both offline and online settings. We consider preference labels subject to privacy constraints and/or adversarial corruption, and analyze two distinct interplays between them: privacy-first and corruption-first. For the privacy-only setting, we show that log loss with an MLE-style algorithm achieves near-optimal rates, in contrast to conventional wisdom. For the joint privacy-and-corruption setting, we first demonstrate that existing offline algorithms in fact provide stronger guarantees – simultaneously in terms of corruption level and privacy parameters – than previously known, which further yields improved bounds in the corruption-only regime. In addition, we also present the first set of results for private and robust online alignment. Our results are enabled by new uniform convergence guarantees for log loss and square loss under privacy and corruption, which we believe have broad applicability across learning theory and statistics.

[378] MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling

Mahdi Karami, Ali Behrouz, Peilin Zhong, Razvan Pascanu, Vahab Mirrokni

Main category: cs.LG

TL;DR: MS-SSM introduces a multi-scale state-space model framework that captures both fine-grained and coarse patterns across multiple resolutions, improving memory efficiency and long-range modeling while maintaining computational efficiency.

Details

Motivation: Traditional SSMs have limited effective memory and struggle to capture multi-scale dependencies, which are essential for modeling complex structures in time series, images, and natural language. They require larger state sizes for improved recall and fail to handle hierarchical patterns effectively.

Method: Proposes a multi-scale SSM framework that represents sequence dynamics across multiple resolutions, with each resolution processed by specialized state-space dynamics. Introduces an input-dependent scale-mixer for dynamic information fusion across resolutions.

Result: MS-SSM consistently outperforms prior SSM-based models on benchmarks including Long Range Arena, hierarchical reasoning, time series classification, and image recognition. It demonstrates improved sequence modeling, particularly in long-range and hierarchical tasks.

Conclusion: Multi-resolution processing in state-space architectures significantly enhances memory efficiency and long-range modeling capabilities, making MS-SSM a superior alternative to traditional SSMs for complex sequence modeling tasks across various domains.

Abstract: State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling. They rely on linear recurrences to integrate information over time, enabling fast inference, parallelizable training, and control over recurrence stability. However, traditional SSMs often suffer from limited effective memory, requiring larger state sizes for improved recall. Moreover, existing SSMs struggle to capture multi-scale dependencies, which are essential for modeling complex structures in time series, images, and natural language. This paper introduces a multi-scale SSM framework that addresses these limitations by representing sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics. By capturing both fine-grained, high-frequency patterns and coarse, global trends, MS-SSM enhances memory efficiency and long-range modeling. We further introduce an input-dependent scale-mixer, enabling dynamic information fusion across resolutions. The proposed approach significantly improves sequence modeling, particularly in long-range and hierarchical tasks, while maintaining computational efficiency. Extensive experiments on benchmarks, including Long Range Arena, hierarchical reasoning, time series classification, and image recognition, demonstrate that MS-SSM consistently outperforms prior SSM-based models, highlighting the benefits of multi-resolution processing in state-space architectures.

[379] Exploiting the Prior of Generative Time Series Imputation

YuYang Miao, Chang Li, Zehua Chen

Main category: cs.LG

TL;DR: Bridge-TS improves time series imputation by using data-to-data generation with expert and compositional priors from pretrained models, achieving state-of-the-art accuracy on benchmark datasets.

Details

Motivation: Existing generative methods for time series imputation (like diffusion models and Schrodinger bridges) use uninformative priors (Gaussian noise or linear interpolation), which increases generation burden and limits imputation accuracy. The authors aim to improve prior design to enhance generative imputation performance.

Method: Bridge-TS introduces a data-to-data generation process with two novel prior designs: 1) Expert prior - uses a pretrained transformer-based module to provide deterministic estimations as informative priors; 2) Compositional priors - combines multiple pretrained models’ estimations in the generation process for compositional priors-to-target imputation.

Result: Experiments on benchmark datasets (ETT, Exchange, Weather) show Bridge-TS achieves new state-of-the-art imputation accuracy in terms of mean square error (MSE) and mean absolute error (MAE).

Conclusion: Improving prior design in generative time series imputation through expert and compositional priors significantly enhances imputation accuracy, demonstrating the superiority of informative priors over traditional uninformative ones.

Abstract: Time series imputation, i.e., filling the missing values of a time recording, finds various applications in electricity, finance, and weather modelling. Previous methods have introduced generative models such as diffusion probabilistic models and Schrodinger bridge models to conditionally generate the missing values from Gaussian noise or directly from linear interpolation results. However, as their prior is not informative to the ground-truth target, their generation process inevitably suffer increased burden and limited imputation accuracy. In this work, we present Bridge-TS, building a data-to-data generation process for generative time series imputation and exploiting the design of prior with two novel designs. Firstly, we propose expert prior, leveraging a pretrained transformer-based module as an expert to fill the missing values with a deterministic estimation, and then taking the results as the prior of ground truth target. Secondly, we explore compositional priors, utilizing several pretrained models to provide different estimation results, and then combining them in the data-to-data generation process to achieve a compositional priors-to-target imputation process. Experiments conducted on several benchmark datasets such as ETT, Exchange, and Weather show that Bridge-TS reaches a new record of imputation accuracy in terms of mean square error and mean absolute error, demonstrating the superiority of improving prior for generative time series imputation.

[380] Trellis: Learning to Compress Key-Value Memory in Attention Models

Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni

Main category: cs.LG

TL;DR: Trellis is a novel Transformer architecture with bounded memory that dynamically compresses KV cache at test time using a two-pass recurrent compression mechanism.

Details

Motivation: Transformers suffer from quadratic computational complexity and ever-growing KV cache memory requirements, which limits their efficiency and scalability for long sequences.

Method: Replaces standard KV cache with fixed-size memory and trains a two-pass recurrent compression mechanism with online gradient descent and forget gate to dynamically compress and update memory at test time.

Result: Outperforms strong baselines on language modeling, common-sense reasoning, recall-intensive tasks, and time series. Performance gains increase with sequence length.

Conclusion: Trellis offers an efficient Transformer architecture with bounded memory that shows strong potential for long-context applications by dynamically compressing KV cache.

Abstract: Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.

[381] Flow Matching Neural Processes

Hussen Abu Hamad, Dan Rosenbaum

Main category: cs.LG

TL;DR: A new neural process model using flow matching for conditional distribution prediction with ODE-based sampling and controllable accuracy-speed tradeoff.

Details

Motivation: To create a simpler, more effective neural process model that can sample from conditional distributions without auxiliary conditioning methods, offering better performance and a controllable tradeoff between accuracy and computation time.

Method: Introduces a neural process model based on flow matching, a generative modeling paradigm. The model provides amortized predictions of conditional distributions over arbitrary points and uses an ODE solver for sampling without requiring auxiliary conditioning methods.

Result: Outperforms previous state-of-the-art neural process methods on various benchmarks including synthetic 1D Gaussian processes data, 2D images, and real-world weather data.

Conclusion: The flow matching-based neural process model offers a simpler implementation, effective conditional sampling via ODE solver, and superior performance across multiple domains while providing a controllable accuracy-speed tradeoff.

Abstract: Neural processes (NPs) are a class of models that learn stochastic processes directly from data and can be used for inference, sampling and conditional sampling. We introduce a new NP model based on flow matching, a generative modeling paradigm that has demonstrated strong performance on various data modalities. Following the NP training framework, the model provides amortized predictions of conditional distributions over any arbitrary points in the data. Compared to previous NP models, our model is simple to implement and can be used to sample from conditional distributions using an ODE solver, without requiring auxiliary conditioning methods. In addition, the model provides a controllable tradeoff between accuracy and running time via the number of steps in the ODE solver. We show that our model outperforms previous state-of-the-art neural process methods on various benchmarks including synthetic 1D Gaussian processes data, 2D images, and real-world weather data.

[382] Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Yue Guan, Changming Yu, Shihan Fang, Weiming Hu, Zaifeng Pan, Zheng Wang, Zihan Liu, Yangjie Zhou, Yufei Ding, Minyi Guo, Jingwen Leng

Main category: cs.LG

TL;DR: Yggdrasil is a system that optimizes speculative decoding for LLM inference through context-aware tree drafting and compiler-friendly execution, achieving up to 3.98× speedup over state-of-the-art baselines.

Details

Motivation: Existing speculative decoding systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions, limiting the potential speedup of LLM inference.

Method: Yggdrasil introduces: 1) an equal-growth tree structure for static graph compatibility, 2) a latency-aware optimization objective for draft selection, and 3) stage-based scheduling to reduce overhead. It supports unmodified LLMs and uses context-aware tree drafting with compiler-friendly execution.

Result: Yggdrasil achieves up to 3.98× speedup over state-of-the-art baselines across multiple hardware setups while supporting unmodified LLMs.

Conclusion: Yggdrasil demonstrates that co-designing speculative decoding systems with context-aware tree drafting and compiler-friendly execution enables latency-optimal LLM inference with significant speedup over existing approaches.

Abstract: Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

[383] Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining

Ruizhe Huang, Kexuan Zhang, Yihao Fang, Baifeng Yu

Main category: cs.LG

TL;DR: Infini-attention enhances long-context capabilities in small language models (300M params) through compressed memory, achieving up to 31% higher accuracy than baseline despite some performance degradation at extreme lengths.

Details

Motivation: Enable efficient small language models (SLMs) for low-resource settings by improving long-context extrapolation capabilities while maintaining computational efficiency and reducing costs.

Method: Empirical study using 300M-parameter LLaMA models pretrained with Infini-attention, which builds compressed memory from past segments while preserving local attention mechanisms.

Result: Infini-attention demonstrates training stability and outperforms baseline in long-context retrieval, achieving up to 31% higher accuracy despite performance degradation at 16,384-token contexts. Balance factor identified as key performance component.

Conclusion: Architectural memory mechanisms like Infini-attention effectively compensate for SLMs’ limited parameters and are beneficial for achieving robust long-context capabilities in small language models.

Abstract: This study investigates small-scale pretraining for Small Language Models (SLMs) to enable efficient use of limited data and compute, improve accessibility in low-resource settings and reduce costs. To enhance long-context extrapolation in compact models, we focus on Infini-attention, which builds a compressed memory from past segments while preserving local attention. In our work, we conduct an empirical study using 300M-parameter LLaMA models pretrained with Infini-attention. The model demonstrates training stability and outperforms the baseline in long-context retrieval. We identify the balance factor as a key part of the model performance, and we found that retrieval accuracy drops with repeated memory compressions over long sequences. Even so, Infini-attention still effectively compensates for the SLM’s limited parameters. Particularly, despite performance degradation at a 16,384-token context, the Infini-attention model achieves up to 31% higher accuracy than the baseline. Our findings suggest that achieving robust long-context capability in SLMs benefits from architectural memory like Infini-attention.

[384] Max-Entropy Reinforcement Learning with Flow Matching and A Case Study on LQR

Yuyang Zhang, Yang Hu, Bo Dai, Na Li

Main category: cs.LG

TL;DR: SAC-Flow: A soft actor-critic variant using flow-based models for more expressive policies, with online flow matching for efficient updates.

Details

Motivation: Standard SAC uses simple policy approximations for efficiency, sacrificing expressiveness and robustness. Flow-based models offer rich expressiveness but need efficient online training methods.

Method: Parameterizes SAC policy with flow-based models, uses instantaneous change-of-variable for policy evaluation, and develops importance sampling flow matching (ISFM) for online policy updates with samples from user-specified distributions.

Result: Theoretical analysis of ISFM shows how sampling distribution choices affect learning efficiency. Case study on max-entropy LQR demonstrates learning optimal action distributions.

Conclusion: SAC-Flow combines flow-based policies with efficient online training, achieving expressive action distributions while maintaining practical efficiency.

Abstract: Soft actor-critic (SAC) is a popular algorithm for max-entropy reinforcement learning. In practice, the energy-based policies in SAC are often approximated using simple policy classes for efficiency, sacrificing the expressiveness and robustness. In this paper, we propose a variant of the SAC algorithm that parameterizes the policy with flow-based models, leveraging their rich expressiveness. In the algorithm, we evaluate the flow-based policy utilizing the instantaneous change-of-variable technique and update the policy with an online variant of flow matching developed in this paper. This online variant, termed importance sampling flow matching (ISFM), enables policy update with only samples from a user-specified sampling distribution rather than the unknown target distribution. We develop a theoretical analysis of ISFM, characterizing how different choices of sampling distributions affect the learning efficiency. Finally, we conduct a case study of our algorithm on the max-entropy linear quadratic regulator problems, demonstrating that the proposed algorithm learns the optimal action distribution.

[385] Efficient Deep Learning for Short-Term Solar Irradiance Time Series Forecasting: A Benchmark Study in Ho Chi Minh City

Tin Hoang

Main category: cs.LG

TL;DR: Benchmark of 10 deep learning models for 1-hour ahead solar irradiance forecasting shows Transformer achieves best accuracy (R²=0.9696), with Knowledge Distillation enabling 23.5% model compression while improving performance.

Details

Motivation: Reliable GHI forecasting is crucial for managing solar energy variability in power grids, requiring accurate short-term predictions to support grid stability and integration of renewable energy sources.

Method: Comprehensive benchmark of 10 deep learning architectures (LSTM, TCN, Transformer, Informer, iTransformer, TSMixer, Mamba, etc.) using 10 years of high-resolution NSRDB satellite data from Ho Chi Minh City for 1-hour ahead forecasting, with SHAP analysis for interpretability and Knowledge Distillation for model compression.

Result: Transformer achieved highest predictive accuracy (R²=0.9696), SHAP analysis revealed Transformer’s recency bias vs Mamba’s 24-hour periodic dependency utilization, and Knowledge Distillation compressed Transformer by 23.5% while reducing MAE to 23.78 W/m².

Conclusion: Transformer is optimal for GHI forecasting, Knowledge Distillation enables efficient deployment on edge devices, and different architectures exhibit distinct temporal reasoning patterns that can inform model selection for specific forecasting needs.

Abstract: Reliable forecasting of Global Horizontal Irradiance (GHI) is essential for mitigating the variability of solar energy in power grids. This study presents a comprehensive benchmark of ten deep learning architectures for short-term (1-hour ahead) GHI time series forecasting in Ho Chi Minh City, leveraging high-resolution NSRDB satellite data (2011-2020) to compare established baselines (e.g. LSTM, TCN) against emerging state-of-the-art architectures, including Transformer, Informer, iTransformer, TSMixer, and Mamba. Experimental results identify the Transformer as the superior architecture, achieving the highest predictive accuracy with an R^2 of 0.9696. The study further utilizes SHAP analysis to contrast the temporal reasoning of these architectures, revealing that Transformers exhibit a strong “recency bias” focused on immediate atmospheric conditions, whereas Mamba explicitly leverages 24-hour periodic dependencies to inform predictions. Furthermore, we demonstrate that Knowledge Distillation can compress the high-performance Transformer by 23.5% while surprisingly reducing error (MAE: 23.78 W/m^2), offering a proven pathway for deploying sophisticated, low-latency forecasting on resource-constrained edge devices.

[386] Rethinking Dense Linear Transformations: Stagewise Pairwise Mixing (SPM) for Near-Linear Training in Neural Networks

Peter Farag

Main category: cs.LG

TL;DR: SPM replaces dense linear layers with sparse pairwise-mixing stages, achieving O(nL) complexity instead of O(n²), with L typically constant or log₂n, while maintaining exact computations and improving generalization on structured tasks.

Details

Motivation: Dense linear layers dominate computational and parametric costs in ML models due to quadratic complexity, and they often don't align well with the compositional structure of learned representations.

Method: Stagewise Pairwise Mixers (SPM) replace dense matrices with compositions of sparse pairwise-mixing stages. Two parameterizations are derived: orthogonal norm-preserving rotation-based variant and fully general 2×2 mixing variant, both with exact closed-form forward/backward computations.

Result: Proof-of-concept experiments show substantial reductions in wall-clock cost and improved accuracy on structured learning problems, while maintaining competitive performance on real-world benchmarks.

Conclusion: SPM provides an efficient drop-in replacement for dense linear layers that not only reduces computational costs but also introduces beneficial compositional inductive bias for better generalization on structured tasks.

Abstract: Dense linear layers are a dominant source of computational and parametric cost in modern machine learning models, despite their quadratic complexity and often being misaligned with the compositional structure of learned representations. We introduce Stagewise Pairwise Mixers (SPM), a structured linear operator that replaces dense matrices with a composition of sparse pairwise-mixing stages. An SPM layer implements a global linear transformation in $O(nL)$ time with $O(nL)$ parameters, where $L$ is typically constant or $log_2n$, and admits exact closed-form forward and backward computations. SPM is designed as a drop-in replacement for dense linear layers in feedforward networks, recurrent architectures, attention mechanisms, etc. We derive complete forward and backward expressions for two parameterizations: an orthogonal norm-preserving rotation-based variant and a fully general $2 \times 2$ mixing variant. Beyond computational savings, the stagewise structure of SPM induces an explicit compositional inductive bias that constrains model capacity and improves generalization when aligned with task structure. We present proof-of-concept experiments demonstrating substantial reductions in wall-clock cost and improved accuracy on structured learning problems, while retaining competitive performance on real-world benchmarks.

[387] Constraint Breeds Generalization: Temporal Dynamics as an Inductive Bias

Xia Chen

Main category: cs.LG

TL;DR: The paper argues that physical constraints in biological systems act as temporal inductive biases that promote generalization, not limitations. It shows dissipative dynamics compress phase space to extract invariant features, and demonstrates a critical transition regime maximizes generalization across tasks.

Details

Motivation: Conventional deep learning focuses on unconstrained optimization, but biological systems operate under strict metabolic constraints. The authors propose these constraints actually serve as temporal inductive biases that breed generalization rather than being limitations.

Method: Phase-space analysis of signal propagation reveals expansive dynamics amplify noise while dissipative dynamics compress phase space. This can be imposed externally via input encoding or intrinsically through network temporal dynamics. The approach requires architectures capable of temporal integration with proper constraints to decode induced invariants.

Result: Comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning demonstrate that a critical “transition” regime maximizes generalization capability. Static architectures fail to capitalize on temporal structure.

Conclusion: Dynamical constraints represent a distinct class of inductive bias. Robust AI development requires not just scaling and removing limitations, but computationally mastering temporal characteristics that naturally promote generalization.

Abstract: Conventional deep learning prioritizes unconstrained optimization, yet biological systems operate under strict metabolic constraints. We propose that these physical constraints shape dynamics to function not as limitations, but as a temporal inductive bias that breeds generalization. Through a phase-space analysis of signal propagation, we reveal a fundamental asymmetry: expansive dynamics amplify noise, whereas proper dissipative dynamics compress phase space that aligns with the network’s spectral bias, compelling the abstraction of invariant features. This condition can be imposed externally via input encoding, or intrinsically through the network’s own temporal dynamics. Both pathways require architectures capable of temporal integration and proper constraints to decode induced invariants, whereas static architectures fail to capitalize on temporal structure. Through comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning, we demonstrate that a critical “transition” regime maximizes generalization capability. These findings establish dynamical constraints as a distinct class of inductive bias, suggesting that robust AI development requires not only scaling and removing limitations, but computationally mastering the temporal characteristics that naturally promote generalization.

[388] Interactive Machine Learning: From Theory to Scale

Yinglun Zhu

Main category: cs.LG

TL;DR: This dissertation advances interactive machine learning theory with efficient algorithms for active learning, contextual bandits with large action spaces, and model selection under partial feedback.

Details

Motivation: Traditional ML methods require large labeled datasets or extensive trial-and-error, which is expensive, time-consuming, or risky in large-scale/high-stakes settings. Interactive learning offers a solution by actively guiding information collection.

Method: Develops new algorithmic principles for interactive learning across three dimensions: 1) active learning with noisy data and rich model classes, 2) sequential decision making with large action spaces, and 3) model selection under partial feedback.

Result: First computationally efficient active learning algorithms achieving exponential label savings without low-noise assumptions; first efficient general-purpose contextual bandit algorithms with guarantees independent of action space size; first tight characterizations of fundamental cost of model selection in sequential decision making.

Conclusion: Advances theoretical foundations of interactive learning with statistically optimal and computationally efficient algorithms, providing principled guidance for deploying interactive methods in large-scale real-world settings.

Abstract: Machine learning has achieved remarkable success across a wide range of applications, yet many of its most effective methods rely on access to large amounts of labeled data or extensive online interaction. In practice, acquiring high-quality labels and making decisions through trial-and-error can be expensive, time-consuming, or risky, particularly in large-scale or high-stakes settings. This dissertation studies interactive machine learning, in which the learner actively influences how information is collected or which actions are taken, using past observations to guide future interactions. We develop new algorithmic principles and establish fundamental limits for interactive learning along three dimensions: active learning with noisy data and rich model classes, sequential decision making with large action spaces, and model selection under partial feedback. Our results include the first computationally efficient active learning algorithms achieving exponential label savings without low-noise assumptions; the first efficient, general-purpose contextual bandit algorithms whose guarantees are independent of the size of the action space; and the first tight characterizations of the fundamental cost of model selection in sequential decision making. Overall, this dissertation advances the theoretical foundations of interactive learning by developing algorithms that are statistically optimal and computationally efficient, while also providing principled guidance for deploying interactive learning methods in large-scale, real-world settings.

[389] Improved Balanced Classification with Theoretically Grounded Loss Functions

Corinna Cortes, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: The paper introduces two advanced surrogate loss families (GLA and GCA) for imbalanced classification, showing GCA has stronger theoretical guarantees with better scaling in highly imbalanced settings.

Details

Motivation: Balanced loss is important for fairness in imbalanced classification but is intractable to minimize directly, creating a need for effective surrogate losses with strong theoretical guarantees.

Method: Proposes Generalized Logit-Adjusted (GLA) losses that generalize Logit-Adjusted losses to cross-entropy family, and Generalized Class-Aware weighted (GCA) losses that extend class-weighted losses with class-dependent confidence margins.

Result: GLA losses are Bayes-consistent but only H-consistent for unbounded hypothesis sets with bounds scaling as 1/p_min. GCA losses are H-consistent for any hypothesis set with bounds scaling as 1/√p_min, offering stronger guarantees in imbalanced settings.

Conclusion: Both GCA and GLA outperform standard approaches, with GLA performing slightly better in common benchmarks and GCA showing advantage in highly imbalanced settings due to its stronger theoretical scaling properties.

Abstract: The balanced loss is a widely adopted objective for multi-class classification under class imbalance. By assigning equal importance to all classes, regardless of their frequency, it promotes fairness and ensures that minority classes are not overlooked. However, directly minimizing the balanced classification loss is typically intractable, which makes the design of effective surrogate losses a central question. This paper introduces and studies two advanced surrogate loss families: Generalized Logit-Adjusted (GLA) loss functions and Generalized Class-Aware weighted (GCA) losses. GLA losses generalize Logit-Adjusted losses, which shift logits based on class priors, to the broader general cross-entropy loss family. GCA loss functions extend the standard class-weighted losses, which scale losses inversely by class frequency, by incorporating class-dependent confidence margins and extending them to the general cross-entropy family. We present a comprehensive theoretical analysis of consistency for both loss families. We show that GLA losses are Bayes-consistent, but only $H$-consistent for complete (i.e., unbounded) hypothesis sets. Moreover, their $H$-consistency bounds depend inversely on the minimum class probability, scaling at least as $1/\mathsf p_{\min}$. In contrast, GCA losses are $H$-consistent for any hypothesis set that is bounded or complete, with $H$-consistency bounds that scale more favorably as $1/\sqrt{\mathsf p_{\min}}$, offering significantly stronger theoretical guarantees in imbalanced settings. We report the results of experiments demonstrating that, empirically, both the GCA losses with calibrated class-dependent confidence margins and GLA losses can greatly outperform straightforward class-weighted losses as well as the LA losses. GLA generally performs slightly better in common benchmarks, whereas GCA exhibits a slight edge in highly imbalanced settings.

[390] DivQAT: Enhancing Robustness of Quantized Convolutional Neural Networks against Model Extraction Attacks

Kacem Khaled, Felipe Gohring de Magalhães, Gabriela Nicolescu

Main category: cs.LG

TL;DR: DivQAT is a novel Quantization Aware Training algorithm that enhances quantized CNN robustness against model extraction attacks by modifying the quantization process itself.

Details

Motivation: Quantized CNNs are vulnerable to extraction attacks (IP theft), but existing defenses are limited: they're added post-training, computationally expensive, have unrealistic assumptions for edge devices, and don't work well with quantized models.

Method: DivQAT modifies the quantization process during Quantization Aware Training to integrate extraction defense directly into model design, rather than adding it as an afterthought.

Result: Empirical validation on benchmark vision datasets shows DivQAT effectively defends against model extraction attacks without compromising model accuracy. It also improves other defense mechanisms when combined with traditional QAT.

Conclusion: DivQAT is the first technique to modify quantization for extraction defense integration during training, offering practical protection for quantized CNNs on edge devices without accuracy loss.

Abstract: Convolutional Neural Networks (CNNs) and their quantized counterparts are vulnerable to extraction attacks, posing a significant threat of IP theft. Yet, the robustness of quantized models against these attacks is little studied compared to large models. Previous defenses propose to inject calculated noise into the prediction probabilities. However, these defenses are limited since they are not incorporated during the model design and are only added as an afterthought after training. Additionally, most defense techniques are computationally expensive and often have unrealistic assumptions about the victim model that are not feasible in edge device implementations and do not apply to quantized models. In this paper, we propose DivQAT, a novel algorithm to train quantized CNNs based on Quantization Aware Training (QAT) aiming to enhance their robustness against extraction attacks. To the best of our knowledge, our technique is the first to modify the quantization process to integrate a model extraction defense into the training process. Through empirical validation on benchmark vision datasets, we demonstrate the efficacy of our technique in defending against model extraction attacks without compromising model accuracy. Furthermore, combining our quantization technique with other defense mechanisms improves their effectiveness compared to traditional QAT.

[391] Physics-informed Graph Neural Networks for Operational Flood Modeling

Carlo Malapad Acosta, Herath Mudiyanselage Viraj Vidura Herath, Jia Yu Lim, Abhishek Saha, Sanka Rasnayaka, Lucy Marshall

Main category: cs.LG

TL;DR: DUALFloodGNN: A novel graph neural network architecture for flood modeling that embeds physical constraints at global and local scales, jointly predicts water volume and flow, and uses multi-step training with curriculum learning for improved autoregressive inference.

Details

Motivation: Physics-based flood models are accurate but computationally expensive, limiting their use in operational settings where rapid predictions are needed. While GNNs offer speed and accuracy with unstructured spatial processing, there's a need to better incorporate physical constraints and improve interpretability in flood prediction models.

Method: DUALFloodGNN embeds physical constraints through explicit loss terms at both global and local scales. It uses a shared message-passing framework to jointly predict water volume at nodes and flow along edges. Training employs multi-step loss enhanced with dynamic curriculum learning to improve autoregressive inference performance.

Result: DUALFloodGNN achieves substantial improvements over standard GNN architectures and state-of-the-art GNN flood models in predicting multiple hydrologic variables while maintaining high computational efficiency.

Conclusion: The proposed DUALFloodGNN architecture successfully combines physics-informed techniques with GNNs for flood modeling, providing both speed and accuracy with improved interpretability through embedded physical constraints. The model is open-sourced for community use.

Abstract: Flood models inform strategic disaster management by simulating the spatiotemporal hydrodynamics of flooding. While physics-based numerical flood models are accurate, their substantial computational cost limits their use in operational settings where rapid predictions are essential. Models designed with graph neural networks (GNNs) provide both speed and accuracy while having the ability to process unstructured spatial domains. Given its flexible input and architecture, GNNs can be leveraged alongside physics-informed techniques with ease, significantly improving interpretability. This study introduces a novel flood GNN architecture, DUALFloodGNN, which embeds physical constraints at both global and local scales through explicit loss terms. The model jointly predicts water volume at nodes and flow along edges through a shared message-passing framework. To improve performance for autoregressive inference, model training is conducted with a multi-step loss enhanced with dynamic curriculum learning. Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables while maintaining high computational efficiency. The model is open-sourced at https://github.com/acostacos/dual_flood_gnn.

[392] Causify DataFlow: A Framework For High-performance Machine Learning Stream Computing

Giacinto Paolo Saggese, Paul Smith

Main category: cs.LG

TL;DR: DataFlow is a computational framework for building ML systems on unbounded time-series data that ensures identical execution between batch development and streaming production without code changes.

Details

Motivation: Traditional data science workflows assume finite datasets and require substantial reimplementation when moving from batch prototypes to streaming production, leading to causality violations, batch boundary artifacts, and poor reproducibility of real-time failures.

Method: Uses a unified execution model based on directed acyclic graphs (DAGs) with point-in-time idempotency, where outputs at time t depend only on a fixed-length context window preceding t. Enforces strict causality by automatically tracking knowledge time across transformations and supports flexible tiling across temporal and feature dimensions.

Result: Demonstrated effectiveness across domains including financial trading, IoT, fraud detection, and real-time analytics. Provides identical execution between batch and streaming modes without code changes.

Conclusion: DataFlow resolves the gap between batch ML development and streaming production by providing a unified framework that eliminates causality violations and ensures reproducibility while maintaining compatibility with the Python data science stack.

Abstract: We present DataFlow, a computational framework for building, testing, and deploying high-performance machine learning systems on unbounded time-series data. Traditional data science workflows assume finite datasets and require substantial reimplementation when moving from batch prototypes to streaming production systems. This gap introduces causality violations, batch boundary artifacts, and poor reproducibility of real-time failures. DataFlow resolves these issues through a unified execution model based on directed acyclic graphs (DAGs) with point-in-time idempotency: outputs at any time t depend only on a fixed-length context window preceding t. This guarantee ensures that models developed in batch mode execute identically in streaming production without code changes. The framework enforces strict causality by automatically tracking knowledge time across all transformations, eliminating future-peeking bugs. DataFlow supports flexible tiling across temporal and feature dimensions, allowing the same model to operate at different frequencies and memory profiles via configuration alone. It integrates natively with the Python data science stack and provides fit/predict semantics for online learning, caching and incremental computation, and automatic parallelization through DAG-based scheduling. We demonstrate its effectiveness across domains including financial trading, IoT, fraud detection, and real-time analytics.

[393] Assured Autonomy: How Operations Research Powers and Orchestrates Generative AI Systems

Tinglong Dai, David Simchi-Levi, Michelle Xiao Wu, Yao Xie

Main category: cs.LG

TL;DR: GenAI is shifting from conversational assistants to autonomous agentic systems, creating an autonomy paradox where greater autonomy requires more formal structure and constraints. The paper proposes an operations research framework for assured autonomy using flow-based generative models and adversarial robustness approaches.

Details

Motivation: The shift of Generative AI from conversational assistants to autonomous agentic systems creates an autonomy paradox: as systems gain more operational autonomy, they paradoxically need more formal structure, explicit constraints, and stronger risk discipline to ensure safety and reliability in operational domains.

Method: Develops a conceptual framework for assured autonomy grounded in operations research with two complementary approaches: 1) Flow-based generative models that frame generation as deterministic transport characterized by ordinary differential equations, enabling auditability and constraint-aware generation; 2) Operational safety formulated through adversarial robustness lens where decision rules are evaluated against worst-case perturbations within uncertainty sets.

Result: The framework clarifies how increasing autonomy shifts operations research’s role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. This defines a research agenda for assured autonomy in safety-critical domains.

Conclusion: Stochastic generative models can be fragile in operational domains unless paired with mechanisms providing verifiable feasibility, robustness to distribution shift, and stress testing. The proposed operations research framework addresses these challenges by combining flow-based generative models with adversarial robustness approaches to ensure assured autonomy in safety-critical operational domains.

Abstract: Generative artificial intelligence (GenAI) is shifting from conversational assistants toward agentic systems – autonomous decision-making systems that sense, decide, and act within operational workflows. This shift creates an autonomy paradox: as GenAI systems are granted greater operational autonomy, they should, by design, embody more formal structure, more explicit constraints, and stronger tail-risk discipline. We argue stochastic generative models can be fragile in operational domains unless paired with mechanisms that provide verifiable feasibility, robustness to distribution shift, and stress testing under high-consequence scenarios. To address this challenge, we develop a conceptual framework for assured autonomy grounded in operations research (OR), built on two complementary approaches. First, flow-based generative models frame generation as deterministic transport characterized by an ordinary differential equation, enabling auditability, constraint-aware generation, and connections to optimal transport, robust optimization, and sequential decision control. Second, operational safety is formulated through an adversarial robustness lens: decision rules are evaluated against worst-case perturbations within uncertainty or ambiguity sets, making unmodeled risks part of the design. This framework clarifies how increasing autonomy shifts OR’s role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. These elements define a research agenda for assured autonomy in safety-critical, reliability-sensitive operational domains.

[394] Information-Theoretic Quality Metric of Low-Dimensional Embeddings

Sebastián Gutiérrez-Bernal, Hector Medel Cobaxin, Abiel Galindo González

Main category: cs.LG

TL;DR: The paper introduces ERPM, an information-theoretic metric for evaluating low-dimensional embeddings that measures information preservation through Shannon entropy of neighborhood singular-value spectra, complementing existing distance-based and geometric metrics.

Details

Motivation: Classical embedding evaluation metrics (stress, rank-based criteria, Local Procrustes) only measure distance or geometric distortions, not how much information is preserved when projecting high-dimensional data to lower dimensions. There's a need for metrics that directly assess information preservation.

Method: Introduces Entropy Rank Preservation Measure (ERPM) - a local metric based on Shannon entropy of singular-value spectrum of neighborhood matrices and stable rank. It quantifies changes in uncertainty between original and projected representations, providing both neighborhood-level indicators and global summary statistics.

Result: Distance-based metrics (MRRE) show very low correlation with geometric (Local Procrustes) and spectral (ERPM) measures. ERPM and Local Procrustes show strong average correlation but significant local discrepancies. ERPM identifies neighborhoods with severe information loss that other metrics miss.

Conclusion: ERPM complements existing embedding evaluation metrics by providing an information-theoretic perspective, enabling more comprehensive assessment of embeddings. It’s particularly valuable for information-sensitive applications like early-warning indicator construction where preserving information is critical.

Abstract: In this work we study the quality of low-dimensional embeddings from an explicitly information-theoretic perspective. We begin by noting that classical evaluation metrics such as stress, rank-based neighborhood criteria, or Local Procrustes quantify distortions in distances or in local geometries, but do not directly assess how much information is preserved when projecting high-dimensional data onto a lower-dimensional space. To address this limitation, we introduce the Entropy Rank Preservation Measure (ERPM), a local metric based on the Shannon entropy of the singular-value spectrum of neighborhood matrices and on the stable rank, which quantifies changes in uncertainty between the original representation and its reduced projection, providing neighborhood-level indicators and a global summary statistic. To validate the results of the metric, we compare its outcomes with the Mean Relative Rank Error (MRRE), which is distance-based, and with Local Procrustes, which is based on geometric properties, using a financial time series and a manifold commonly studied in the literature. We observe that distance-based criteria exhibit very low correlation with geometric and spectral measures, while ERPM and Local Procrustes show strong average correlation but display significant discrepancies in local regimes, leading to the conclusion that ERPM complements existing metrics by identifying neighborhoods with severe information loss, thereby enabling a more comprehensive assessment of embeddings, particularly in information-sensitive applications such as the construction of early-warning indicators.

[395] Tracing the Heart’s Pathways: ECG Representation Learning from a Cardiac Conduction Perspective

Tan Pan, Yixuan Sun, Chen Jiang, Qiong Gao, Rui Sun, Xingmeng Zhang, Zhenqi Yang, Limei Han, Yixiu Liang, Yuan Cheng, Kaiyu Guo

Main category: cs.LG

TL;DR: CLEAR-HUG is a two-stage ECG self-supervised learning framework that captures subtle cardiac conduction variations across leads while following clinical diagnostic workflows, achieving 6.84% performance improvement across six tasks.

Details

Motivation: Existing ECG self-supervised learning methods focus on consistent patterns but overlook inherent heartbeat differences from cardiac conduction processes, and fail to align with clinical diagnostic guidelines that progress from individual heartbeats to lead combinations.

Method: Two-stage framework: 1) CLEAR (Conduction-LEAd Reconstructor) - an eSSL model using sparse attention to reconstruct signals while treating each heartbeat as distinct entity, capturing specific variations and general commonalities. 2) HUG (Hierarchical lead-Unified Group head) - a diagnostic module mirroring clinical workflow from heartbeats to leads to lead combinations.

Result: Experimental results across six tasks show a 6.84% improvement, validating the framework’s effectiveness in enhancing cardiac conduction representations and aligning patterns with expert diagnostic guidelines.

Conclusion: CLEAR-HUG successfully addresses limitations of previous eSSL methods by capturing subtle cardiac conduction variations and following clinical diagnostic workflows, demonstrating significant performance improvements in ECG analysis tasks.

Abstract: The multi-lead electrocardiogram (ECG) stands as a cornerstone of cardiac diagnosis. Recent strides in electrocardiogram self-supervised learning (eSSL) have brightened prospects for enhancing representation learning without relying on high-quality annotations. Yet earlier eSSL methods suffer a key limitation: they focus on consistent patterns across leads and beats, overlooking the inherent differences in heartbeats rooted in cardiac conduction processes, while subtle but significant variations carry unique physiological signatures. Moreover, representation learning for ECG analysis should align with ECG diagnostic guidelines, which progress from individual heartbeats to single leads and ultimately to lead combinations. This sequential logic, however, is often neglected when applying pre-trained models to downstream tasks. To address these gaps, we propose CLEAR-HUG, a two-stage framework designed to capture subtle variations in cardiac conduction across leads while adhering to ECG diagnostic guidelines. In the first stage, we introduce an eSSL model termed Conduction-LEAd Reconstructor (CLEAR), which captures both specific variations and general commonalities across heartbeats. Treating each heartbeat as a distinct entity, CLEAR employs a simple yet effective sparse attention mechanism to reconstruct signals without interference from other heartbeats. In the second stage, we implement a Hierarchical lead-Unified Group head (HUG) for disease diagnosis, mirroring clinical workflow. Experimental results across six tasks show a 6.84% improvement, validating the effectiveness of CLEAR-HUG. This highlights its ability to enhance representations of cardiac conduction and align patterns with expert diagnostic guidelines.

[396] Hyperspherical Graph Representation Learning via Adaptive Neighbor-Mean Alignment and Uniformity

Rui Chen, Junjun Guo, Hongbin Wang, Yan Xiang, Yantuan Xian, Zhengtao Yu

Main category: cs.LG

TL;DR: HyperGRL is a hyperspherical graph representation learning framework using adaptive neighbor-mean alignment and sampling-free uniformity for stable, high-quality embeddings without complex negative sampling.

Details

Motivation: Existing graph representation learning methods rely on surrogate contrastive objectives or mutual information maximization, requiring complex architectures, negative sampling strategies, and sensitive hyperparameter tuning, which can cause over-smoothing, over-squashing, and training instability.

Method: HyperGRL embeds nodes on a unit hypersphere using two adversarially coupled objectives: neighbor-mean alignment (using mean representation of local neighborhoods as stable targets) and sampling-free uniformity (L2-based hyperspherical regularization for uniform distribution). Includes entropy-guided adaptive balancing mechanism for dynamic regulation.

Result: Extensive experiments on node classification, node clustering, and link prediction show HyperGRL delivers superior representation quality and generalization, achieving average improvements of 1.49%, 0.86%, and 0.74% over strongest existing methods respectively.

Conclusion: HyperGRL demonstrates effectiveness of geometrically grounded, sampling-free contrastive objectives for graph representation learning, providing stable training and high-quality embeddings without complex negative sampling strategies.

Abstract: Graph representation learning (GRL) aims to encode structural and semantic dependencies of graph-structured data into low-dimensional embeddings. However, existing GRL methods often rely on surrogate contrastive objectives or mutual information maximization, which typically demand complex architectures, negative sampling strategies, and sensitive hyperparameter tuning. These design choices may induce over-smoothing, over-squashing, and training instability. In this work, we propose HyperGRL, a unified framework for hyperspherical graph representation learning via adaptive neighbor-mean alignment and sampling-free uniformity. HyperGRL embeds nodes on a unit hypersphere through two adversarially coupled objectives: neighbor-mean alignment and sampling-free uniformity. The alignment objective uses the mean representation of each node’s local neighborhood to construct semantically grounded, stable targets that capture shared structural and feature patterns. The uniformity objective formulates dispersion via an L2-based hyperspherical regularization, encouraging globally uniform embedding distributions while preserving discriminative information. To further stabilize training, we introduce an entropy-guided adaptive balancing mechanism that dynamically regulates the interplay between alignment and uniformity without requiring manual tuning. Extensive experiments on node classification, node clustering, and link prediction demonstrate that HyperGRL delivers superior representation quality and generalization across diverse graph structures, achieving average improvements of 1.49%, 0.86%, and 0.74% over the strongest existing methods, respectively. These findings highlight the effectiveness of geometrically grounded, sampling-free contrastive objectives for graph representation learning.

[397] How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns

Haoyue Bai, Yiyou Sun, Wenjie Hu, Shi Qiu, Maggie Ziyu Huan, Peiyang Song, Robert Nowak, Dawn Song

Main category: cs.LG

TL;DR: This paper introduces a novel benchmark to analyze why SFT narrows LLM capabilities while RL preserves them, by decomposing reasoning into atomic skills and tracking behavioral changes during training.

Details

Motivation: LLMs show divergent generalization behaviors: SFT often narrows capabilities while RL tuning tends to preserve them. Prior studies relied on coarse accuracy metrics, leaving the reasons behind this divergence unclear.

Method: Introduces a novel benchmark that decomposes reasoning into atomic core skills (calculation, fact retrieval, simulation, enumeration, diagnostic). Uses meta-probing framework to track model behavior at different training stages and analyzes low-level statistical patterns like distributional divergence and parameter statistics.

Result: RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, while SFT models exhibit sharper drift and overfit to surface patterns. The benchmark provides granular insights into how specific cognitive abilities emerge, transfer, and sometimes collapse during post-training.

Conclusion: This work provides new insights into the nature of reasoning in LLMs and points toward principles for designing training strategies that foster broad, robust generalization, moving beyond coarse accuracy metrics to understand fundamental reasoning components.

Abstract: Large Language Models (LLMs) display strikingly different generalization behaviors: supervised fine-tuning (SFT) often narrows capability, whereas reinforcement-learning (RL) tuning tends to preserve it. The reasons behind this divergence remain unclear, as prior studies have largely relied on coarse accuracy metrics. We address this gap by introducing a novel benchmark that decomposes reasoning into atomic core skills such as calculation, fact retrieval, simulation, enumeration, and diagnostic, providing a concrete framework for addressing the fundamental question of what constitutes reasoning in LLMs. By isolating and measuring these core skills, the benchmark offers a more granular view of how specific cognitive abilities emerge, transfer, and sometimes collapse during post-training. Combined with analyses of low-level statistical patterns such as distributional divergence and parameter statistics, it enables a fine-grained study of how generalization evolves under SFT and RL across mathematical, scientific reasoning, and non-reasoning tasks. Our meta-probing framework tracks model behavior at different training stages and reveals that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns. This work provides new insights into the nature of reasoning in LLMs and points toward principles for designing training strategies that foster broad, robust generalization.

[398] Time-varying Mixing Matrix Design for Energy-efficient Decentralized Federated Learning

Xusheng Zhang, Tuan Nguyen, Ting He

Main category: cs.LG

TL;DR: The paper proposes a mixing matrix design framework for decentralized federated learning in wireless networks that minimizes maximum per-node energy consumption by optimizing time-varying communication topologies.

Details

Motivation: Existing mixing matrix designs for decentralized federated learning focus on minimizing communication time but neglect per-node energy consumption, which is critical for energy-constrained wireless devices. There's a gap in optimizing energy efficiency while leveraging the broadcast nature of wireless communications.

Method: The authors develop a novel convergence theorem allowing arbitrarily time-varying mixing matrices, then propose a multi-phase design framework that activates time-varying communication topologies under optimized budgets. This trades off per-iteration energy consumption with convergence rate while balancing energy across nodes.

Result: Evaluations based on real data validate that the proposed solution effectively combines the low energy consumption of sparse mixing matrices with the fast convergence of dense mixing matrices, minimizing maximum per-node energy consumption.

Conclusion: The work provides a theoretically-justified mixing matrix design for decentralized federated learning that optimizes energy efficiency in wireless networks, addressing the critical need for energy-constrained devices while maintaining convergence performance.

Abstract: We consider the design of mixing matrices to minimize the operation cost for decentralized federated learning (DFL) in wireless networks, with focus on minimizing the maximum per-node energy consumption. As a critical hyperparameter for DFL, the mixing matrix controls both the convergence rate and the needs of agent-to-agent communications, and has thus been studied extensively. However, existing designs mostly focused on minimizing the communication time, leaving open the minimization of per-node energy consumption that is critical for energy-constrained devices. This work addresses this gap through a theoretically-justified solution for mixing matrix design that aims at minimizing the maximum per-node energy consumption until convergence, while taking into account the broadcast nature of wireless communications. Based on a novel convergence theorem that allows arbitrarily time-varying mixing matrices, we propose a multi-phase design framework that activates time-varying communication topologies under optimized budgets to trade off the per-iteration energy consumption and the convergence rate while balancing the energy consumption across nodes. Our evaluations based on real data have validated the efficacy of the proposed solution in combining the low energy consumption of sparse mixing matrices and the fast convergence of dense mixing matrices.

Jiazhao Shi, Ziyu Wang, Yichen Lin, Shoufeng Lu

Main category: cs.LG

TL;DR: TPI-AI: A hybrid framework combining deep temporal representations with physics-inspired interaction features for robust lane-change intention prediction across heterogeneous highway scenarios, achieving state-of-the-art performance on drone-based datasets.

Details

Motivation: Lane-change intention prediction is safety-critical for autonomous driving but remains challenging due to noisy kinematics, severe class imbalance, and limited generalization across different highway scenarios (straight highways vs. ramp-rich environments).

Method: Temporal Physics-Informed AI (TPI-AI) fuses a two-layer bidirectional LSTM encoder (for temporal trajectory embeddings) with physics-inspired interaction features (headway, TTC, safe-gap indicators). These combined features train a LightGBM classifier for three-class intention recognition (No-LC, Left-LC, Right-LC), with imbalance-aware optimization including resampling/weighting and fold-wise threshold calibration.

Result: Outperforms standalone LightGBM and Bi-LSTM baselines on highD (straight highways) and exiD (ramp-rich environments) datasets. Achieves macro-F1 scores: highD - 0.9562, 0.9124, 0.8345; exiD - 0.9247, 0.8197, 0.7605 at T = 1, 2, 3 seconds prediction horizons respectively.

Conclusion: Combining physics-informed interaction features with learned temporal embeddings yields robust multi-scenario lane-change intention prediction, demonstrating strong generalization across heterogeneous highway environments.

Abstract: Lane-change intention prediction is safety-critical for autonomous driving and ADAS, but remains difficult in naturalistic traffic due to noisy kinematics, severe class imbalance, and limited generalization across heterogeneous highway scenarios. We propose Temporal Physics-Informed AI (TPI-AI), a hybrid framework that fuses deep temporal representations with physics-inspired interaction cues. A two-layer bidirectional LSTM (Bi-LSTM) encoder learns compact embeddings from multi-step trajectory histories; we concatenate these embeddings with kinematics-, safety-, and interaction-aware features (e.g., headway, TTC, and safe-gap indicators) and train a LightGBM classifier for three-class intention recognition (No-LC, Left-LC, Right-LC). To improve minority-class reliability, we apply imbalance-aware optimization including resampling/weighting and fold-wise threshold calibration. Experiments on two large-scale drone-based datasets, highD (straight highways) and exiD (ramp-rich environments), use location-based splits and evaluate prediction horizons T = 1, 2, 3 s. TPI-AI outperforms standalone LightGBM and Bi-LSTM baselines, achieving macro-F1 of 0.9562, 0.9124, 0.8345 on highD and 0.9247, 0.8197, 0.7605 on exiD at T = 1, 2, 3 s, respectively. These results show that combining physics-informed interaction features with learned temporal embeddings yields robust multi-scenario lane-change intention prediction.

[400] Autoregressivity in the Latent Space of a GP-VAE Language Model: An Empirical Ablation Study

Yves Ruffenach

Main category: cs.LG

TL;DR: Ablation study shows latent autoregression in GP-VAE improves latent trajectory compatibility with Gaussian process prior and long-horizon stability compared to non-autoregressive variants.

Details

Motivation: To systematically analyze the role of latent autoregression in GP-VAE models, comparing it against non-autoregressive latent variables and standard token-level autoregressive Transformers.

Method: Conducted ablation study comparing three models: (1) full GP-VAE with autoregressive latent dynamics, (2) non-autoregressive ablation with independent latent variables, and (3) standard token-level autoregressive Transformer.

Result: Latent autoregression induces latent trajectories more compatible with Gaussian-process prior and exhibits greater long-horizon stability. Removing autoregression degrades latent structure and causes unstable long-range behavior.

Conclusion: Latent autoregression effectively organizes long-range structure and complements token-level autoregressive modeling, but this is an empirical analysis rather than a new architecture proposal.

Abstract: This paper provides an ablation-based analysis of latent autoregression in GP-VAE models, building upon our previous work introducing the architecture. Language models typically rely on an autoregressive factorization over tokens. In contrast, our prior work proposed shifting sequential structure to the latent space through a causal Gaussian process, while using a non-autoregressive decoder. Here, we conduct a systematic ablation study of the role played by latent autoregression. We compare (i) a full GP-VAE model with autoregressive latent dynamics, (ii) a non-autoregressive ablation in which latent variables are independent, and (iii) a standard token-level autoregressive Transformer. Our results show that, within the considered regime (medium-scale corpora and short training contexts), latent autoregression induces latent trajectories that are significantly more compatible with the Gaussian-process prior and exhibit greater long-horizon stability. In contrast, removing autoregression leads to degraded latent structure and unstable long-range behavior. These findings highlight the role of latent autoregression as an effective mechanism for organizing long-range structure, while remaining complementary to token-level autoregressive modeling. They should be interpreted as an empirical analysis of representational structure rather than as a proposal for a new architecture.

[401] Enhancing LLM Planning Capabilities through Intrinsic Self-Critique

Bernd Bohnet, Pierre-Alexandre Kamienny, Hanie Sedghi, Dilan Gorur, Pranjal Awasthi, Aaron Parisi, Kevin Swersky, Rosanne Liu, Azade Nova, Noah Fiedel

Main category: cs.LG

TL;DR: LLMs can significantly improve planning performance through intrinsic self-critique without external verification, achieving state-of-the-art results on planning benchmarks.

Details

Motivation: Despite previous research questioning the effectiveness of LLM self-critique methods, the authors aim to demonstrate that LLMs can meaningfully critique their own answers to enhance planning performance without relying on external verifiers.

Method: The approach uses few-shot learning extended to many-shot as a base method, then employs an iterative process for correction and refinement through intrinsic self-critique. The method focuses on planning domains like Blocksworld, Logistics, and Mini-grid.

Result: The method achieves significant performance gains over established planning benchmarks, exceeding strong baseline accuracies and presenting new state-of-the-art results for LLM model checkpoints from October 2024.

Conclusion: Self-critique can significantly boost LLM planning performance, demonstrating intrinsic self-improvement capabilities that are model-version agnostic, with potential for even better results when applied to more complex search techniques and more capable models.

Abstract: We demonstrate an approach for LLMs to critique their \emph{own} answers with the goal of enhancing their performance that leads to significant improvements over established planning benchmarks. Despite the findings of earlier research that has cast doubt on the effectiveness of LLMs leveraging self critique methods, we show significant performance gains on planning datasets in the Blocksworld domain through intrinsic self-critique, without external source such as a verifier. We also demonstrate similar improvements on Logistics and Mini-grid datasets, exceeding strong baseline accuracies. We employ a few-shot learning technique and progressively extend it to a many-shot approach as our base method and demonstrate that it is possible to gain substantial improvement on top of this already competitive approach by employing an iterative process for correction and refinement. We illustrate how self-critique can significantly boost planning performance. Our empirical results present new state-of-the-art on the class of models considered, namely LLM model checkpoints from October 2024. Our primary focus lies on the method itself, demonstrating intrinsic self-improvement capabilities that are applicable regardless of the specific model version, and we believe that applying our method to more complex search techniques and more capable models will lead to even better performance.

[402] OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization

Advait Gadhikar, Riccardo Grazzi, James Hensman

Main category: cs.LG

TL;DR: OptRot and OptRot+ are rotation-based methods that reduce weight outliers in LLMs to improve quantization performance, with OptRot+ incorporating activation covariance for better results.

Details

Motivation: Large Language Models have outliers in weights and activations that make quantization difficult, and existing rotation methods need improvement in both effectiveness and efficiency.

Method: Proposes two methods: OptRot minimizes element-wise fourth power of rotated weights to reduce outliers, and OptRot+ adds activation covariance information. Both use GPTQ as quantization method.

Result: OptRot outperforms Hadamard rotations and more expensive methods like SpinQuant and OSTQuant for weight quantization, and improves W4A8 activation quantization. OptRot+ further improves performance but both methods perform worse in W4A4 setting.

Conclusion: Rotation-based methods can effectively reduce quantization errors, but there’s a trade-off between weight and activation quantization, with different methods optimal for different precision settings.

Abstract: The presence of outliers in Large Language Models (LLMs) weights and activations makes them difficult to quantize. Recent work has leveraged rotations to mitigate these outliers. In this work, we propose methods that learn fusible rotations by minimizing principled and cheap proxy objectives to the weight quantization error. We primarily focus on GPTQ as the quantization method. Our main method is OptRot, which reduces weight outliers simply by minimizing the element-wise fourth power of the rotated weights. We show that OptRot outperforms both Hadamard rotations and more expensive, data-dependent methods like SpinQuant and OSTQuant for weight quantization. It also improves activation quantization in the W4A8 setting. We also propose a data-dependent method, OptRot$^{+}$, that further improves performance by incorporating information on the activation covariance. In the W4A4 setting, we see that both OptRot and OptRot$^{+}$ perform worse, highlighting a trade-off between weight and activation quantization.

[403] GARDO: Reinforcing Diffusion Models without Reward Hacking

Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan

Main category: cs.LG

TL;DR: GARDO is a reinforcement learning framework for fine-tuning diffusion models that addresses reward hacking, exploration challenges, and mode collapse through selective regularization, adaptive reference updates, and diversity-aware optimization.

Details

Motivation: Current RL fine-tuning of diffusion models suffers from reward hacking (proxy scores increase while real quality deteriorates), poor exploration due to sub-optimal reference policies, and mode collapse (loss of generation diversity).

Method: GARDO introduces three key components: 1) Selective regularization that penalizes only high-uncertainty samples, 2) Adaptive reference model updates to match online policy capabilities, and 3) Diversity-aware reward amplification for high-quality, diverse samples.

Result: Extensive experiments show GARDO effectively mitigates reward hacking, enhances generation diversity, maintains sample efficiency, and enables better exploration across diverse proxy rewards and unseen evaluation metrics.

Conclusion: GARDO provides a versatile RL framework that successfully addresses the competing demands of preventing reward hacking, enabling effective exploration, and maintaining generation diversity in diffusion model fine-tuning.

Abstract: Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

[404] Colorful Pinball: Density-Weighted Quantile Regression for Conditional Guarantee of Conformal Prediction

Qianyi Chen, Bo Li

Main category: cs.LG

TL;DR: The paper proposes a method to improve conditional coverage in conformal prediction by refining quantile regression with a density-weighted pinball loss, achieving better conditional coverage than standard approaches.

Details

Motivation: While conformal prediction provides marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Exact distribution-free conditional coverage is impossible with finite samples, so the paper aims to improve conditional coverage of standard conformal procedures.

Method: The authors derive a density-weighted pinball loss for quantile regression using Taylor expansion, then propose a three-headed quantile network that estimates weights via finite differences using auxiliary quantile levels at 1-α±δ, and fine-tunes the central quantile by optimizing the weighted loss.

Result: Theoretical analysis provides exact non-asymptotic guarantees characterizing the excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

Conclusion: The proposed method effectively improves conditional coverage in conformal prediction by directly minimizing the mean squared error of conditional coverage through refined quantile regression with density-weighted loss, outperforming standard approaches.

Abstract: While conformal prediction provides robust marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Although exact distribution-free conditional coverage is impossible with finite samples, recent work has focused on improving the conditional coverage of standard conformal procedures. Distinct from approaches that target relaxed notions of conditional coverage, we directly minimize the mean squared error of conditional coverage by refining the quantile regression components that underpin many conformal methods. Leveraging a Taylor expansion, we derive a sharp surrogate objective for quantile regression: a density-weighted pinball loss, where the weights are given by the conditional density of the conformity score evaluated at the true quantile. We propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at (1-α\pm δ), subsequently fine-tuning the central quantile by optimizing the weighted loss. We provide a theoretical analysis with exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

[405] Paired Seed Evaluation: Statistical Reliability for Learning-Based Simulators

Udit Sharma

Main category: cs.LG

TL;DR: Paired seed evaluation design improves statistical efficiency in ML system comparisons by using identical random seeds across alternatives, reducing variance through seed-level correlation.

Details

Motivation: Machine learning systems appear stochastic but are deterministically random due to seeded pseudorandom number generators. Current evaluation methods for comparing algorithms/designs/interventions suffer from high variance from random initialization and learning stochasticity, and standard independent evaluation fails to exploit shared randomness across alternatives.

Method: Formalizes a paired seed evaluation design where competing systems are evaluated under identical random seeds, inducing matched realisations of stochastic components. This creates strict variance reduction when outcomes are positively correlated at the seed level.

Result: Paired seed evaluation yields tighter confidence intervals, higher statistical power, and effective sample size gains at fixed computational budgets. Empirical results show seed-level correlations are typically large and positive, producing order-of-magnitude efficiency gains.

Conclusion: Paired seed evaluation is weakly dominant in practice - it improves statistical reliability when correlation is present and reduces to independent evaluation without loss of validity when correlation is absent.

Abstract: Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across executions. Learning-based simulators are widely used to compare algorithms, design choices, and interventions under such dynamics, yet evaluation outcomes often exhibit high variance due to random initialisation and learning stochasticity. We analyse the statistical structure of comparative evaluation in these settings and show that standard independent evaluation designs fail to exploit shared sources of randomness across alternatives. We formalise a paired seed evaluation design in which competing systems are evaluated under identical random seeds, inducing matched realisations of stochastic components and strict variance reduction whenever outcomes are positively correlated at the seed level. This yields tighter confidence intervals, higher statistical power, and effective sample size gains at fixed computational budgets. Empirically, seed-level correlations are typically large and positive, producing order-of-magnitude efficiency gains. Paired seed evaluation is weakly dominant in practice, improving statistical reliability when correlation is present and reducing to independent evaluation without loss of validity when it is not.

[406] Micro-Macro Tensor Neural Surrogates for Uncertainty Quantification in Collisional Plasma

Wei Chen, Giacomo Dimarco, Lorenzo Pareschi

Main category: cs.LG

TL;DR: A variance-reduced Monte Carlo framework using neural network surrogates for uncertainty quantification in plasma kinetic equations, specifically the Vlasov-Poisson-Landau system.

Details

Motivation: Plasma kinetic equations are highly sensitive to microscopic perturbations, making reliable uncertainty quantification essential. Traditional methods face challenges from sampling costs, high-dimensional phase space, multiscale stiffness, and collision term complexities.

Method: Couples high-fidelity VPL solver with inexpensive neural network surrogates (VPFP and EP models). Uses generalized separable physics-informed neural networks (SPINN) with anisotropic micro-macro decomposition to reduce dimensionality. Calibrates VPFP model and designs asymptotic-preserving SPINN to improve correlation with VPL.

Result: Substantial variance reduction over standard Monte Carlo, accurate statistics with far fewer high-fidelity samples, lower wall-clock time, and robustness to stochastic dimension.

Conclusion: The framework successfully addresses computational challenges in plasma kinetic UQ by combining high-fidelity solvers with efficient neural network surrogates, enabling practical uncertainty quantification for complex plasma systems.

Abstract: Plasma kinetic equations exhibit pronounced sensitivity to microscopic perturbations in model parameters and data, making reliable and efficient uncertainty quantification (UQ) essential for predictive simulations. However, the cost of uncertainty sampling, the high-dimensional phase space, and multiscale stiffness pose severe challenges to both computational efficiency and error control in traditional numerical methods. These aspects are further emphasized in presence of collisions where the high-dimensional nonlocal collision integrations and conservation properties pose severe constraints. To overcome this, we present a variance-reduced Monte Carlo framework for UQ in the Vlasov–Poisson–Landau (VPL) system, in which neural network surrogates replace the multiple costly evaluations of the Landau collision term. The method couples a high-fidelity, asymptotic-preserving VPL solver with inexpensive, strongly correlated surrogates based on the Vlasov–Poisson–Fokker–Planck (VPFP) and Euler–Poisson (EP) equations. For the surrogate models, we introduce a generalization of the separable physics-informed neural network (SPINN), developing a class of tensor neural networks based on an anisotropic micro-macro decomposition, to reduce velocity-moment costs, model complexity, and the curse of dimensionality. To further increase correlation with VPL, we calibrate the VPFP model and design an asymptotic-preserving SPINN whose small- and large-Knudsen limits recover the EP and VP systems, respectively. Numerical experiments show substantial variance reduction over standard Monte Carlo, accurate statistics with far fewer high-fidelity samples, and lower wall-clock time, while maintaining robustness to stochastic dimension.

[407] Early Prediction of Sepsis using Heart Rate Signals and Genetic Optimized LSTM Algorithm

Alireza Rafiei, Farshid Hajati, Alireza Rezaee, Amirhossien Panahi, Shahadat Uddin

Main category: cs.LG

TL;DR: Researchers developed four machine learning models optimized for wearable devices to predict sepsis onset using heart rate data, achieving promising results for early detection outside ICU settings.

Details

Motivation: Sepsis causes high mortality and healthcare costs, but existing prediction models focus mainly on ICU patients. There's a significant gap in early sepsis detection methods for non-ward settings where wearable devices could provide continuous monitoring.

Method: Developed four novel machine learning algorithms specifically designed for wearable devices using heart rate data. Used genetic algorithm to optimize model architecture for performance, computational complexity, and memory requirements. Applied transfer learning to extend prediction window from 1 hour to 4 hours.

Result: The models demonstrated encouraging performance metrics suitable for implementation on wearable devices with accurate heart rate monitoring capabilities, showing feasibility for early sepsis detection outside traditional healthcare settings.

Conclusion: Wearable technology shows promising potential for facilitating early sepsis detection in non-ICU and non-ward environments, which could help reduce adverse outcomes through timely intervention.

Abstract: Sepsis, characterized by a dysregulated immune response to infection, results in significant mortality, morbidity, and healthcare costs. The timely prediction of sepsis progression is crucial for reducing adverse outcomes through early intervention. Despite the development of numerous models for Intensive Care Unit (ICU) patients, there remains a notable gap in approaches for the early detection of sepsis in non-ward settings. This research introduces and evaluates four novel machine learning algorithms designed for predicting the onset of sepsis on wearable devices by analyzing heart rate data. The architecture of these models was refined through a genetic algorithm, optimizing for performance, computational complexity, and memory requirements. Performance metrics were subsequently extracted for each model to evaluate their feasibility for implementation on wearable devices capable of accurate heart rate monitoring. The models were initially tailored for a prediction window of one hour, later extended to four hours through transfer learning. The encouraging outcomes of this study suggest the potential for wearable technology to facilitate early sepsis detection outside ICU and ward environments.

Haojin Li, Anbang Zhang, Chen Sun, Chenyuan Feng, Kaiqian Qu, Tony Q. S. Quek, Haijun Zhang

Main category: cs.LG

TL;DR: SaM2B: A semantic-aware multi-modal beam prediction framework with reliability-aware dynamic weighting and cross-modal contrastive learning for UAV communications in low-altitude economy.

Details

Motivation: Current multi-modal beam prediction methods use fixed/empirical weights assuming equal modality reliability, but modality importance fluctuates dramatically with UAV motion scenarios. Static weighting amplifies negative impacts of degraded modalities, and modal mismatch/weak alignment undermine cross-scenario generalization.

Method: Proposes SaM2B framework with: 1) Reliability-aware dynamic weighting scheme that adaptively allocates contributions across modalities (environmental visual, flight posture, geospatial data) based on real-time reliability; 2) Cross-modal contrastive learning to align “multi-source representation beam semantics” to a shared semantic space, enhancing discriminative power and robustness under modal noise and distribution shifts.

Result: Experiments on real-world low-altitude UAV datasets show SaM2B achieves more satisfactory results than baseline methods.

Conclusion: SaM2B effectively addresses modality reliability fluctuations and alignment issues in multi-modal beam prediction for UAV communications, improving performance over static weighting approaches through adaptive dynamic weighting and semantic alignment.

Abstract: The low-altitude economy (LAE) is rapidly expanding driven by urban air mobility, logistics drones, and aerial sensing, while fast and accurate beam prediction in uncrewed aerial vehicles (UAVs) communications is crucial for achieving reliable connectivity. Current research is shifting from single-signal to multi-modal collaborative approaches. However, existing multi-modal methods mostly employ fixed or empirical weights, assuming equal reliability across modalities at any given moment. Indeed, the importance of different modalities fluctuates dramatically with UAV motion scenarios, and static weighting amplifies the negative impact of degraded modalities. Furthermore, modal mismatch and weak alignment further undermine cross-scenario generalization. To this end, we propose a reliability-aware dynamic weighting scheme applied to a semantic-aware multi-modal beam prediction framework, named SaM2B. Specifically, SaM2B leverages lightweight cues such as environmental visual, flight posture, and geospatial data to adaptively allocate contributions across modalities at different time points through reliability-aware dynamic weight updates. Moreover, by utilizing cross-modal contrastive learning, we align the “multi-source representation beam semantics” associated with specific beam information to a shared semantic space, thereby enhancing discriminative power and robustness under modal noise and distribution shifts. Experiments on real-world low-altitude UAV datasets show that SaM2B achieves more satisfactory results than baseline methods.

[409] Tubular Riemannian Laplace Approximations for Bayesian Neural Networks

Rodrigo Pereira David

Main category: cs.LG

TL;DR: TRL is a new Bayesian approximation method that models neural network posteriors as probabilistic tubes along low-loss valleys, using Riemannian geometry to separate uncertainty types and achieving ensemble-grade calibration at 1/5 the training cost.

Details

Motivation: Traditional Laplace approximations struggle with the anisotropic, curved loss surfaces and large symmetry groups in modern deep neural networks, failing to adapt to their complex geometric structure.

Method: TRL models the posterior as a probabilistic tube following low-loss valleys induced by functional symmetries. It uses Fisher/Gauss-Newton metrics to separate prior-dominated tangential uncertainty from data-dominated transverse uncertainty, operating as a scalable reparametrized Gaussian approximation with implicit curvature estimates.

Result: TRL achieves excellent calibration on ResNet-18 (CIFAR-10 and CIFAR-100), matching or exceeding Deep Ensembles in terms of Expected Calibration Error (ECE) while requiring only 1/5 of the training cost.

Conclusion: TRL effectively bridges the gap between single-model efficiency and ensemble-grade reliability, offering a practical Bayesian approximation method for modern deep neural networks.

Abstract: Laplace approximations are among the simplest and most practical methods for approximate Bayesian inference in neural networks, yet their Euclidean formulation struggles with the highly anisotropic, curved loss surfaces and large symmetry groups that characterize modern deep models. Recent work has proposed Riemannian and geometric Gaussian approximations to adapt to this structure. Building on these ideas, we introduce the Tubular Riemannian Laplace (TRL) approximation. TRL explicitly models the posterior as a probabilistic tube that follows a low-loss valley induced by functional symmetries, using a Fisher/Gauss-Newton metric to separate prior-dominated tangential uncertainty from data-dominated transverse uncertainty. We interpret TRL as a scalable reparametrised Gaussian approximation that utilizes implicit curvature estimates to operate in high-dimensional parameter spaces. Our empirical evaluation on ResNet-18 (CIFAR-10 and CIFAR-100) demonstrates that TRL achieves excellent calibration, matching or exceeding the reliability of Deep Ensembles (in terms of ECE) while requiring only a fraction (1/5) of the training cost. TRL effectively bridges the gap between single-model efficiency and ensemble-grade reliability.

[410] Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning

Soham Pahari, M. Srinivas

Main category: cs.LG

TL;DR: ViReLoc is a visual reasoning framework for navigation and localization that uses only visual representations instead of text, learning spatial dependencies and geometric relations through reinforcement learning and contrastive learning.

Details

Motivation: Current multimodal reasoning systems rely too heavily on textual information, limiting their effectiveness in spatial tasks like visual navigation and geo-localization where visual understanding of spatial relationships is crucial.

Method: Proposes Geo-Consistent Visual Planning framework (ViReLoc) that performs planning and localization using visual representations only. Uses reinforcement learning objectives for step-by-step visual inference, integrates contrastive learning and adaptive feature interaction to align cross-view perspectives and reduce viewpoint differences.

Result: Experiments show consistent improvements in spatial reasoning accuracy and cross-view retrieval performance across diverse navigation and localization scenarios. The system can perform tasks without real-time GPS data.

Conclusion: Visual reasoning is a strong complementary approach for navigation and localization, enabling secure navigation solutions without reliance on GPS data, addressing limitations of text-based reasoning in spatial tasks.

Abstract: Multimodal intelligence development recently show strong progress in visual understanding and high level reasoning. Though, most reasoning system still reply on textual information as the main medium for inference. This limit their effectiveness in spatial tasks such as visual navigation and geo-localization. This work discuss about the potential scope of this field and eventually propose an idea visual reasoning paradigm Geo-Consistent Visual Planning, our introduced framework called Visual Reasoning for Localization, or ViReLoc, which performs planning and localization using only visual representations. The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand. By encoding step by step inference in the visual domain and optimizing with reinforcement based objectives, ViReLoc plans routes between two given ground images. The system also integrates contrastive learning and adaptive feature interaction to align cross view perspectives and reduce viewpoint differences. Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance. These results establish visual reasoning as a strong complementary approach for navigation and localization, and show that such tasks can be performed without real time global positioning system data, leading to more secure navigation solutions.

[411] Efficient Inference for Inverse Reinforcement Learning and Dynamic Discrete Choice Models

Lars van der Laan, Aurelien Bibaut, Nathan Kallus

Main category: cs.LG

TL;DR: A semiparametric framework for debiased inverse reinforcement learning that enables statistically efficient inference for reward-dependent functionals in IRL and DDC models, combining flexible ML estimation with valid statistical guarantees.

Details

Motivation: Existing IRL methods lack statistical guarantees while classical DDC approaches are too restrictive and computationally intensive. There's a need for a framework that combines flexible machine learning with valid inference for sequential decision-making models.

Method: Develops a semiparametric framework showing log-behavior policy acts as pseudo-reward, formalizes targets as smooth functionals, derives efficient influence functions, and constructs debiased machine-learning estimators that allow nonparametric nuisance estimation.

Result: Achieves √n-consistency, asymptotic normality, and semiparametric efficiency for policy values and normalized rewards. Extends classical DDC inference to nonparametric rewards with modern ML tools.

Conclusion: Provides a unified, computationally tractable approach to statistical inference in IRL that bridges the gap between flexible ML methods and classical parametric DDC models.

Abstract: Inverse reinforcement learning (IRL) and dynamic discrete choice (DDC) models explain sequential decision-making by recovering reward functions that rationalize observed behavior. Flexible IRL methods typically rely on machine learning but provide no guarantees for valid inference, while classical DDC approaches impose restrictive parametric specifications and often require repeated dynamic programming. We develop a semiparametric framework for debiased inverse reinforcement learning that yields statistically efficient inference for a broad class of reward-dependent functionals in maximum entropy IRL and Gumbel-shock DDC models. We show that the log-behavior policy acts as a pseudo-reward that point-identifies policy value differences and, under a simple normalization, the reward itself. We then formalize these targets, including policy values under known and counterfactual softmax policies and functionals of the normalized reward, as smooth functionals of the behavior policy and transition kernel, establish pathwise differentiability, and derive their efficient influence functions. Building on this characterization, we construct automatic debiased machine-learning estimators that allow flexible nonparametric estimation of nuisance components while achieving $\sqrt{n}$-consistency, asymptotic normality, and semiparametric efficiency. Our framework extends classical inference for DDC models to nonparametric rewards and modern machine-learning tools, providing a unified and computationally tractable approach to statistical inference in IRL.

[412] Sparse classification with positive-confidence data in high dimensions

The Tien Mai, Mai Anh Nguyen, Trung Nghia Nguyen

Main category: cs.LG

TL;DR: Proposes sparse regularization methods for high-dimensional Positive-Confidence (Pconf) classification, using Lasso, SCAD, and MCP penalties to enable effective prediction and variable selection with only positive samples and confidence scores.

Details

Motivation: Sparse regularization techniques are well-established for fully supervised high-dimensional learning but remain underexplored for weak-supervision settings like Pconf classification, which uses only positive samples with confidence scores. Existing Pconf methods are ill-suited for high-dimensional regimes where feature dimension exceeds sample size.

Method: Introduces a novel sparse-penalization framework for high-dimensional Pconf classification using convex (Lasso) and non-convex (SCAD, MCP) penalties to address shrinkage bias and improve feature recovery. Develops an efficient proximal gradient algorithm to solve the composite objective function.

Result: Theoretically establishes estimation and prediction error bounds for the L1-regularized Pconf estimator, proving it achieves near minimax-optimal sparse recovery rates under Restricted Strong Convexity condition. Simulations show proposed methods achieve predictive performance and variable selection accuracy comparable to fully supervised approaches.

Conclusion: The proposed framework effectively bridges the gap between weak supervision and high-dimensional statistics, enabling effective sparse regularization for Pconf classification in high-dimensional settings where only positive samples with confidence scores are available.

Abstract: High-dimensional learning problems, where the number of features exceeds the sample size, often require sparse regularization for effective prediction and variable selection. While established for fully supervised data, these techniques remain underexplored in weak-supervision settings such as Positive-Confidence (Pconf) classification. Pconf learning utilizes only positive samples equipped with confidence scores, thereby avoiding the need for negative data. However, existing Pconf methods are ill-suited for high-dimensional regimes. This paper proposes a novel sparse-penalization framework for high-dimensional Pconf classification. We introduce estimators using convex (Lasso) and non-convex (SCAD, MCP) penalties to address shrinkage bias and improve feature recovery. Theoretically, we establish estimation and prediction error bounds for the L1-regularized Pconf estimator, proving it achieves near minimax-optimal sparse recovery rates under Restricted Strong Convexity condition. To solve the resulting composite objective, we develop an efficient proximal gradient algorithm. Extensive simulations demonstrate that our proposed methods achieve predictive performance and variable selection accuracy comparable to fully supervised approaches, effectively bridging the gap between weak supervision and high-dimensional statistics.

[413] Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics

Akash Samanta, Sheldon Williamson

Main category: cs.LG

TL;DR: A diagnostic-driven adaptive learning framework that decomposes error evolution into bias, noise, and alignment components to provide stable adaptation in dynamic environments across supervised optimization, RL, and learned optimizers.

Details

Motivation: Existing learning methods in nonstationary, safety-critical environments suffer from instability, slow convergence, or brittle adaptation because they ignore the temporal structure of error signals while focusing only on gradient statistics.

Method: Proposes a diagnostic-driven framework that models error evolution through decomposition into: bias (persistent drift), noise (stochastic variability), and alignment (repeated directional excitation causing overshoot). Computes these diagnostics online from lightweight statistics of loss or temporal-difference error trajectories, independent of model architecture or task domain.

Result: The bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic RL, and learned optimizers. Derived diagnostic-driven instantiations include: stabilized supervised optimizer, diagnostic-regulated actor-critic scheme, and diagnostic-conditioned learned optimizer. Established bounded effective updates and stability properties under standard smoothness assumptions.

Conclusion: Elevates error evolution to a first-class object in adaptive learning, providing an interpretable, lightweight foundation for reliable learning in dynamic environments. The framework offers diagnostic illustrations showing how signals modulate adaptation in response to temporal-difference error structure.

Abstract: Learning systems deployed in nonstationary and safety-critical environments often suffer from instability, slow convergence, or brittle adaptation when learning dynamics evolve over time. While modern optimization, reinforcement learning, and meta-learning methods adapt to gradient statistics, they largely ignore the temporal structure of the error signal itself. This paper proposes a diagnostic-driven adaptive learning framework that explicitly models error evolution through a principled decomposition into bias, capturing persistent drift; noise, capturing stochastic variability; and alignment, capturing repeated directional excitation leading to overshoot. These diagnostics are computed online from lightweight statistics of loss or temporal-difference error trajectories and are independent of model architecture or task domain. We show that the proposed bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic reinforcement learning, and learned optimizers. Building on this framework, we derive diagnostic-driven instantiations including a stabilized supervised optimizer, a diagnostic-regulated actor-critic scheme, and a diagnostic-conditioned learned optimizer. Under standard smoothness assumptions, we establish bounded effective updates and stability properties for all cases. Representative diagnostic illustrations in actor-critic learning highlight how the proposed signals modulate adaptation in response to temporal-difference error structure. Overall, this work elevates error evolution to a first-class object in adaptive learning and provides an interpretable, lightweight foundation for reliable learning in dynamic environments.

[414] Generative forecasting with joint probability models

Patrick Wyrod, Ashesh Chattopadhyay, Daniele Venturi

Main category: cs.LG

TL;DR: Joint generative forecasting learns probability distributions over lagged system states for chaotic systems, enabling better uncertainty quantification and long-range statistical accuracy than conventional next-step models.

Details

Motivation: Chaotic systems have fundamental forecasting limitations due to sensitivity to initial conditions and unresolved multiscale processes. Existing generative models focus on next-step prediction rather than capturing underlying dynamic structure.

Method: Reframe forecasting as learning joint probability distributions over lagged system states in short temporal windows, with forecasts obtained through marginalization. Introduce model-agnostic training/inference framework with three uncertainty metrics: ensemble variance, short-horizon autocorrelation, and cumulative Wasserstein drift.

Result: Joint generative models show improved short-term predictive skill, preserve attractor geometry, and achieve substantially more accurate long-range statistical behavior on Lorenz-63 and Kuramoto-Sivashinsky systems compared to conventional conditional next-step models.

Conclusion: Learning joint distributions over system states provides a more comprehensive generative approach to chaotic system forecasting, enabling better uncertainty quantification and capturing of underlying dynamic structure beyond next-step prediction.

Abstract: Chaotic dynamical systems exhibit strong sensitivity to initial conditions and often contain unresolved multiscale processes, making deterministic forecasting fundamentally limited. Generative models offer an appealing alternative by learning distributions over plausible system evolutions; yet, most existing approaches focus on next-step conditional prediction rather than the structure of the underlying dynamics. In this work, we reframe forecasting as a fully generative problem by learning the joint probability distribution of lagged system states over short temporal windows and obtaining forecasts through marginalization. This new perspective allows the model to capture nonlinear temporal dependencies, represent multistep trajectory segments, and produce next-step predictions consistent with the learned joint distribution. We also introduce a general, model-agnostic training and inference framework for joint generative forecasting and show how it enables assessment of forecast robustness and reliability using three complementary uncertainty quantification metrics (ensemble variance, short-horizon autocorrelation, and cumulative Wasserstein drift), without access to ground truth. We evaluate the performance of the proposed method on two canonical chaotic dynamical systems, the Lorenz-63 system and the Kuramoto-Sivashinsky equation, and show that joint generative models yield improved short-term predictive skill, preserve attractor geometry, and achieve substantially more accurate long-range statistical behaviour than conventional conditional next-step models.

[415] HOLOGRAPH: Active Causal Discovery via Sheaf-Theoretic Alignment of Large Language Model Priors

Hyunjun Kim

Main category: cs.LG

TL;DR: HOLOGRAPH is a framework that formalizes LLM-guided causal discovery using sheaf theory, representing local causal beliefs as sections over variable subsets and identifying global causal structure through sheaf cohomology.

Details

Motivation: Causal discovery from observational data faces fundamental identifiability constraints. While recent work uses LLMs as sources of prior causal knowledge, existing approaches rely on heuristic integration without theoretical grounding.

Method: Uses sheaf theory to represent local causal beliefs as sections of a presheaf over variable subsets. Introduces Algebraic Latent Projection for hidden confounders and Natural Gradient Descent on the belief manifold for optimization. Coherent global causal structure corresponds to existence of a global section, with topological obstructions manifesting as non-vanishing sheaf cohomology.

Result: Experiments on synthetic and real-world benchmarks show competitive performance on causal discovery tasks with 50-100 variables. Sheaf-theoretic analysis reveals Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), but Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections.

Conclusion: HOLOGRAPH provides rigorous mathematical foundations for LLM-guided causal discovery through sheaf theory, achieving competitive performance while revealing fundamental limitations in locality assumptions for larger graphs with hidden confounders.

Abstract: Causal discovery from observational data remains fundamentally limited by identifiability constraints. Recent work has explored leveraging Large Language Models (LLMs) as sources of prior causal knowledge, but existing approaches rely on heuristic integration that lacks theoretical grounding. We introduce HOLOGRAPH, a framework that formalizes LLM-guided causal discovery through sheaf theory–representing local causal beliefs as sections of a presheaf over variable subsets. Our key insight is that coherent global causal structure corresponds to the existence of a global section, while topological obstructions manifest as non-vanishing sheaf cohomology. We propose the Algebraic Latent Projection to handle hidden confounders and Natural Gradient Descent on the belief manifold for principled optimization. Experiments on synthetic and real-world benchmarks demonstrate that HOLOGRAPH provides rigorous mathematical foundations while achieving competitive performance on causal discovery tasks with 50-100 variables. Our sheaf-theoretic analysis reveals that while Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), the Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections. Code is available at https://github.com/hyunjun1121/holograph.

[416] Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal

Main category: cs.LG

TL;DR: Small proxy models used for data recipe evaluation give unreliable results because fixed training configurations ignore data-dependent optimal hyperparameters. Using reduced learning rates improves correlation with full-scale LLM performance.

Details

Motivation: Data teams use small proxy models to evaluate pretraining data recipes, but there's limited understanding of whether conclusions from small-scale experiments reliably transfer to full-scale training. The standard protocol uses identical training configurations for all data recipes, which may not reflect how data quality affects optimal training hyperparameters.

Method: The paper identifies a flaw in standard evaluation protocols and proposes a simple patch: using reduced learning rates for proxy model training. Theoretically, they prove this preserves dataset ordering for random-feature models. Empirically, they validate across 23 data recipes covering four critical dimensions of data curation.

Result: The reduced learning rate approach yields relative performance that strongly correlates with fully tuned large-scale LLM pretraining runs. This dramatically improves reliability of small-scale experiments, as conclusions about data quality can flip with minor hyperparameter adjustments in the standard fixed-configuration protocol.

Conclusion: Data recipe assessment should aim to identify recipes that yield best performance under data-specific tuning, not under fixed configurations. The reduced learning rate patch provides a practical, low-cost solution that improves correlation with full-scale training outcomes while preserving theoretical guarantees.

Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment: the use of identical small-scale model training configurations across all data recipes in the name of “fair” comparison. We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyperparameters, as the optimal training configuration is inherently data-dependent. Moreover, this fixed-configuration protocol diverges from full-scale model development pipelines, where hyperparameter optimization is a standard step. Consequently, we posit that the objective of data recipe assessment should be to identify the recipe that yields the best performance under data-specific tuning. To mitigate the high cost of hyperparameter tuning, we introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training. We show that this approach yields relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs. Theoretically, we prove that for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable loss. Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation, demonstrating dramatic improvements in the reliability of small-scale experiments.

[417] Generalising E-prop to Deep Networks

Beren Millidge

Main category: cs.LG

TL;DR: Extends E-prop framework to deep recurrent networks, enabling online learning across both time and depth without backpropagation through time.

Details

Motivation: BPTT is biologically implausible due to backward replay requirements, while RTRL/E-prop are limited to single-layer recurrent networks. Real brain learning involves hierarchical dynamics across both depth and time.

Method: Extends E-prop framework to arbitrarily deep networks by deriving novel recursion relationships across depth, extending eligibility traces to deeper layers while maintaining online forward updates.

Result: Demonstrates an online learning algorithm that can perform accurate credit assignment across both time and depth simultaneously, enabling training of deep recurrent networks without BPTT.

Conclusion: Provides biologically plausible alternative to BPTT for training deep recurrent networks by extending E-prop’s eligibility trace mechanism to handle hierarchical dynamics across both temporal and spatial dimensions.

Abstract: Recurrent networks are typically trained with backpropagation through time (BPTT). However, BPTT requires storing the history of all states in the network and then replaying them sequentially backwards in time. This computation appears extremely implausible for the brain to implement. Real Time Recurrent Learning (RTRL) proposes an mathematically equivalent alternative where gradient information is propagated forwards in time locally alongside the regular forward pass, however it has significantly greater computational complexity than BPTT which renders it impractical for large networks. E-prop proposes an approximation of RTRL which reduces its complexity to the level of BPTT while maintaining a purely online forward update which can be implemented by an eligibility trace at each synapse. However, works on RTRL and E-prop ubiquitously investigate learning in a single layer with recurrent dynamics. However, learning in the brain spans multiple layers and consists of both hierarchal dynamics in depth as well as time. In this mathematical note, we extend the E-prop framework to handle arbitrarily deep networks, deriving a novel recursion relationship across depth which extends the eligibility traces of E-prop to deeper layers. Our results thus demonstrate an online learning algorithm can perform accurate credit assignment across both time and depth simultaneously, allowing the training of deep recurrent networks without backpropagation through time.

[418] More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization

Yuma Ichikawa, Yoshihiko Fujisawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa

Main category: cs.LG

TL;DR: MDBF improves extreme low-bit LLM quantization by replacing DBF’s restrictive single envelope with rank-l envelopes while keeping binary sign bases, boosting accuracy without changing inference efficiency.

Details

Motivation: Double Binary Factorization (DBF) is attractive for extreme low-bit LLM quantization due to efficient inference, but its scaling parameters are too restrictive - all rank components share the same magnitude profile, causing performance saturation.

Method: Propose Multi-envelope DBF (MDBF) that retains shared 1-bit sign bases but replaces the single envelope with a rank-l envelope. Shares sign matrices among envelope components to maintain binary carrier while using memory budget for magnitude expressiveness. Includes closed-form initialization and alternating refinement optimization.

Result: Across LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.

Conclusion: MDBF successfully addresses DBF’s limitations by improving magnitude expressiveness while maintaining efficient binary inference, achieving better accuracy-performance trade-offs for extreme low-bit LLM quantization.

Abstract: For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.

[419] From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme

Xueyan Li, Yingyi Xue, Mengjie Jiang, Qingzi Zhu, Yazhe Niu

Main category: cs.LG

TL;DR: HUMOR is a framework that enhances VLMs for humorous meme generation through hierarchical reasoning and group-wise preference alignment.

Details

Motivation: Generating humorous memes requires nuanced multimodal reasoning beyond simple image-to-caption supervision, needing to bridge visual perception with subjective humor creation.

Method: Uses hierarchical multi-path Chain-of-Thought reasoning with anchoring, plus group-wise pairwise reward modeling and reinforcement learning for preference alignment.

Result: HUMOR empowers VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality in extensive experiments.

Conclusion: Presents a general training paradigm for open-ended, human-aligned multimodal generation using comparative judgment within coherent output groups.

Abstract: Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR}, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human preferences. First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path. This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity. We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the practical condition that high-quality paths retain significant probability mass. Second, to capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template. Following established theory, this approach ensures a consistent and robust proxy for human preference, even with subjective and noisy labels. The reward model then enables a group-wise reinforcement learning optimization, guaranteeing providing a theoretical guarantee for monotonic improvement within the trust region. Extensive experiments show that HUMOR empowers various VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality. Beyond memes, our work presents a general training paradigm for open-ended, human-aligned multimodal generation, where success is guided by comparative judgment within coherent output group.

[420] CPR: Causal Physiological Representation Learning for Robust ECG Analysis under Distribution Shifts

Shunbo Jia, Caizhi Liao

Main category: cs.LG

TL;DR: CPR (Causal Physiological Representation Learning) improves ECG model robustness against adversarial attacks by learning causal representations of pathological features, achieving better performance than existing defenses while maintaining efficient inference.

Details

Motivation: Current ECG deep learning models are vulnerable to adversarial perturbations, especially Smooth Adversarial Perturbations (SAP) that mimic biological morphology. Existing defenses have critical limitations: Adversarial Training is computationally expensive, while certified methods like Randomized Smoothing introduce significant inference latency, making them impractical for real-time clinical monitoring.

Method: CPR incorporates a Physiological Structural Prior within a causal disentanglement framework. It models ECG generation via a Structural Causal Model (SCM) and enforces a structural intervention that strictly separates invariant pathological morphology (P-QRS-T complex) from non-causal artifacts, unlike standard denoising approaches that operate without semantic constraints.

Result: On PTB-XL dataset, CPR significantly outperforms standard clinical preprocessing methods. Under SAP attacks, CPR achieves F1 score of 0.632, surpassing Median Smoothing (0.541 F1) by 9.1%. CPR matches the certified robustness of Randomized Smoothing while maintaining single-pass inference efficiency.

Conclusion: CPR offers a superior trade-off between robustness, efficiency, and clinical interpretability by addressing the root cause of vulnerability - models’ reliance on non-robust spurious correlations rather than invariant pathological features.

Abstract: Deep learning models for Electrocardiogram (ECG) diagnosis have achieved remarkable accuracy but exhibit fragility against adversarial perturbations, particularly Smooth Adversarial Perturbations (SAP) that mimic biological morphology. Existing defenses face a critical dilemma: Adversarial Training (AT) provides robustness but incurs a prohibitive computational burden, while certified methods like Randomized Smoothing (RS) introduce significant inference latency, rendering them impractical for real-time clinical monitoring. We posit that this vulnerability stems from the models’ reliance on non-robust spurious correlations rather than invariant pathological features. To address this, we propose Causal Physiological Representation Learning (CPR). Unlike standard denoising approaches that operate without semantic constraints, CPR incorporates a Physiological Structural Prior within a causal disentanglement framework. By modeling ECG generation via a Structural Causal Model (SCM), CPR enforces a structural intervention that strictly separates invariant pathological morphology (P-QRS-T complex) from non-causal artifacts. Empirical results on PTB-XL demonstrate that CPR significantly outperforms standard clinical preprocessing methods. Specifically, under SAP attacks, CPR achieves an F1 score of 0.632, surpassing Median Smoothing (0.541 F1) by 9.1%. Crucially, CPR matches the certified robustness of Randomized Smoothing while maintaining single-pass inference efficiency, offering a superior trade-off between robustness, efficiency, and clinical interpretability.

[421] Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang

Main category: cs.LG

TL;DR: DLCM introduces hierarchical compression in LLMs, shifting computation from tokens to concepts for more efficient reasoning, achieving +2.69% improvement on benchmarks.

Details

Motivation: Current LLMs waste capacity on predictable tokens while under-allocating computation to semantically critical transitions due to uniform token processing, despite language's non-uniform information density.

Method: Dynamic Large Concept Models (DLCM) learn semantic boundaries from latent representations, compress tokens into variable-length concepts end-to-end, and use a decoupled μP parametrization for stable training across compression regimes.

Result: DLCM achieves +2.69% average improvement across 12 zero-shot benchmarks under matched inference FLOPs, with compression ratio R=4 (4 tokens per concept), reallocating one-third of compute to higher-capacity reasoning.

Conclusion: Hierarchical compression fundamentally changes LLM scaling behavior, enabling more efficient compute allocation through compression-aware scaling laws and concept-level reasoning.

Abstract: Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.

[422] AutoFed: Manual-Free Federated Traffic Prediction via Personalized Prompt

Zijian Zhao, Yitong Shang, Sen Li

Main category: cs.LG

TL;DR: AutoFed is a novel Personalized Federated Learning framework for traffic prediction that eliminates manual hyperparameter tuning using prompt learning techniques.

Details

Motivation: Traffic prediction is crucial for ITS but faces privacy concerns leading to data silos. Standard FL struggles with non-IID data, and current PFL frameworks require impractical hyperparameter optimization unavailable in real-world scenarios.

Method: AutoFed introduces a federated representor with client-aligned adapter to distill local data into a compact, globally shared prompt matrix. This prompt conditions a personalized predictor, enabling knowledge sharing while maintaining local specificity without manual tuning.

Result: Extensive experiments on real-world datasets show AutoFed consistently achieves superior performance across diverse scenarios compared to existing methods.

Conclusion: AutoFed provides a practical PFL solution for traffic prediction that overcomes hyperparameter dependency and enables effective privacy-preserving collaborative learning in real-world settings.

Abstract: Accurate traffic prediction is essential for Intelligent Transportation Systems, including ride-hailing, urban road planning, and vehicle fleet management. However, due to significant privacy concerns surrounding traffic data, most existing methods rely on local training, resulting in data silos and limited knowledge sharing. Federated Learning (FL) offers an efficient solution through privacy-preserving collaborative training; however, standard FL struggles with the non-independent and identically distributed (non-IID) problem among clients. This challenge has led to the emergence of Personalized Federated Learning (PFL) as a promising paradigm. Nevertheless, current PFL frameworks require further adaptation for traffic prediction tasks, such as specialized graph feature engineering, data processing, and network architecture design. A notable limitation of many prior studies is their reliance on hyper-parameter optimization across datasets-information that is often unavailable in real-world scenarios-thus impeding practical deployment. To address this challenge, we propose AutoFed, a novel PFL framework for traffic prediction that eliminates the need for manual hyper-parameter tuning. Inspired by prompt learning, AutoFed introduces a federated representor that employs a client-aligned adapter to distill local data into a compact, globally shared prompt matrix. This prompt then conditions a personalized predictor, allowing each client to benefit from cross-client knowledge while maintaining local specificity. Extensive experiments on real-world datasets demonstrate that AutoFed consistently achieves superior performance across diverse scenarios. The code of this paper is provided at https://github.com/RS2002/AutoFed .

[423] A Scalable Framework for logP Prediction: From Terabyte-Scale Data Integration to Interpretable Ensemble Modeling

Malikussaid, Septian Caesar Floresko, Ade Romadhony, Isman Kurniawan, Warih Maharani, Hilal Hudan Nuha

Main category: cs.LG

TL;DR: Large-scale logP prediction framework using 426K compounds from integrated databases, achieving 740x speedup in data processing and optimal performance with stratified ensemble models.

Details

Motivation: To develop a robust, large-scale predictive framework for lipophilicity (logP) prediction using well-curated data from multiple authoritative chemical databases, addressing data integration challenges and establishing competitive baselines with 2D descriptors.

Method: Created computational infrastructure with byte-offset indexing for 740x faster data processing; evaluated multiple modeling approaches including linear models, Random Forest, and XGBoost; implemented stratified modeling with specialized models for drug-like molecules (91%) and extreme cases (9%).

Result: Tree-based ensembles achieved R²=0.765 and RMSE=0.731; stratified strategy achieved optimal performance with RMSE=0.838 for drug-like subset and R²=0.767 for extreme molecules; molecular weight identified as most important predictor via SHAP analysis despite weak bivariate correlation.

Conclusion: Well-curated descriptor-based ensemble models remain competitive with graph neural networks; stratified modeling provides optimal performance; molecular weight is globally important despite weak bivariate correlation; framework provides actionable guidance for molecular design.

Abstract: This study presents a large-scale predictive modeling framework for logP prediction using 426850 bioactive compounds rigorously curated from the intersection of three authoritative chemical databases: PubChem, ChEMBL, and eMolecules. We developed a novel computational infrastructure to address the data integration challenge, reducing processing time from a projected over 100 days to 3.2 hours through byte-offset indexing architecture, a 740-fold improvement. Our comprehensive analysis revealed critical insights into the multivariate nature of lipophilicity: while molecular weight exhibited weak bivariate correlation with logP, SHAP analysis on ensemble models identified it as the single most important predictor globally. We systematically evaluated multiple modeling approaches, discovering that linear models suffered from inherent heteroskedasticity that classical remediation strategies, including weighted least squares and Box-Cox transformation, failed to address. Tree-based ensemble methods, including Random Forest and XGBoost, proved inherently robust to this violation, achieving an R-squared of 0.765 and RMSE of 0.731 logP units on the test set. Furthermore, a stratified modeling strategy, employing specialized models for drug-like molecules (91 percent of dataset) and extreme cases (nine percent), achieved optimal performance: an RMSE of 0.838 for the drug-like subset and an R-squared of 0.767 for extreme molecules, the highest of all evaluated approaches. These findings provide actionable guidance for molecular design, establish robust baselines for lipophilicity prediction using only 2D descriptors, and demonstrate that well-curated, descriptor-based ensemble models remain competitive with state-of-the-art graph neural network architectures.

[424] HeteroHBA: A Generative Structure-Manipulating Backdoor Attack on Heterogeneous Graphs

Honglin Gao, Lan Zhao, Junhao Ren, Xiang Li, Gaoxi Xiao

Main category: cs.LG

TL;DR: HeteroHBA is a generative backdoor attack framework for heterogeneous graph neural networks that injects trigger nodes with diverse features and connections while maintaining stealthiness through distribution alignment.

Details

Motivation: Backdoor attacks on heterogeneous graphs are understudied despite the widespread use of HGNNs in real-world applications. Existing attacks may not effectively handle the heterogeneous nature of graphs with different node/edge types and features.

Method: 1) Saliency-based screening to select influential auxiliary neighbors for trigger attachment; 2) Synthesis of diverse trigger features and connection patterns matching local heterogeneous context; 3) Adaptive Instance Normalization (AdaIN) with MMD loss to align trigger feature distribution with benign statistics; 4) Bilevel optimization to jointly maximize attack success while preserving clean accuracy.

Result: HeteroHBA achieves higher attack success than prior backdoor baselines on multiple real-world heterogeneous graphs with various HGNN architectures, while maintaining comparable or smaller impact on clean accuracy. The attack remains effective even against heterogeneity-aware structural defense (CSD).

Conclusion: The paper demonstrates practical backdoor risks in heterogeneous graph learning and highlights the need for stronger defenses. HeteroHBA’s effectiveness shows that current defenses may be insufficient against sophisticated heterogeneity-aware attacks.

Abstract: Heterogeneous graph neural networks (HGNNs) have achieved strong performance in many real-world applications, yet targeted backdoor poisoning on heterogeneous graphs remains less studied. We consider backdoor attacks for heterogeneous node classification, where an adversary injects a small set of trigger nodes and connections during training to force specific victim nodes to be misclassified into an attacker-chosen label at test time while preserving clean performance. We propose HeteroHBA, a generative backdoor framework that selects influential auxiliary neighbors for trigger attachment via saliency-based screening and synthesizes diverse trigger features and connection patterns to better match the local heterogeneous context. To improve stealthiness, we combine Adaptive Instance Normalization (AdaIN) with a Maximum Mean Discrepancy (MMD) loss to align the trigger feature distribution with benign statistics, thereby reducing detectability, and we optimize the attack with a bilevel objective that jointly promotes attack success and maintains clean accuracy. Experiments on multiple real-world heterogeneous graphs with representative HGNN architectures show that HeteroHBA consistently achieves higher attack success than prior backdoor baselines with comparable or smaller impact on clean accuracy; moreover, the attack remains effective under our heterogeneity-aware structural defense, CSD. These results highlight practical backdoor risks in heterogeneous graph learning and motivate the development of stronger defenses.

[425] Mobility-Assisted Decentralized Federated Learning: Convergence Analysis and A Data-Driven Approach

Reza Jahani, Md Farhamdur Reza, Richeng Jin, Huaiyu Dai

Main category: cs.LG

TL;DR: DFL performance degrades due to limited connectivity and data heterogeneity; user mobility can enhance information flow and improve DFL performance in sparse networks.

Details

Motivation: DFL suffers from performance degradation due to limited connectivity and data heterogeneity in sparse networks. User mobility, increasingly common in next-gen wireless networks, can act as relays/bridges to enhance information flow, but its impact on DFL has been overlooked despite its potential.

Method: 1) Establish convergence of DFL in sparse networks under user mobility, showing random movement of fraction of users boosts performance. 2) Propose DFL framework utilizing mobile users with induced mobility patterns, allowing them to exploit knowledge of data distribution to determine trajectories for enhanced information propagation.

Result: Theoretical demonstration that even random movement of fraction of users significantly boosts DFL performance. Extensive experiments empirically confirm theoretical findings, validate superiority over baselines, and provide comprehensive analysis of how various network parameters influence DFL performance in mobile networks.

Conclusion: User mobility can significantly improve DFL performance in sparse networks. The proposed framework with induced mobility patterns effectively enhances information propagation, addressing connectivity and data heterogeneity challenges in decentralized federated learning.

Abstract: Decentralized Federated Learning (DFL) has emerged as a privacy-preserving machine learning paradigm that enables collaborative training among users without relying on a central server. However, its performance often degrades significantly due to limited connectivity and data heterogeneity. As we move toward the next generation of wireless networks, mobility is increasingly embedded in many real-world applications. The user mobility, either natural or induced, enables clients to act as relays or bridges, thus enhancing information flow in sparse networks; however, its impact on DFL has been largely overlooked despite its potential. In this work, we systematically investigate the role of mobility in improving DFL performance. We first establish the convergence of DFL in sparse networks under user mobility and theoretically demonstrate that even random movement of a fraction of users can significantly boost performance. Building upon this insight, we propose a DFL framework that utilizes mobile users with induced mobility patterns, allowing them to exploit the knowledge of data distribution to determine their trajectories to enhance information propagation through the network. Through extensive experiments, we empirically confirm our theoretical findings, validate the superiority of our approach over baselines, and provide a comprehensive analysis of how various network parameters influence DFL performance in mobile networks.

[426] Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

Main category: cs.LG

TL;DR: Nested Learning (NL) is a new paradigm representing ML models as nested optimization problems, enabling higher-order in-context learning and continual learning capabilities through expressive optimizers, self-modifying modules, and continuum memory systems.

Details

Motivation: Current language models lack fundamental capabilities for continual learning, self-improvement, and effective solution finding. The paper aims to address these limitations by proposing a new learning paradigm that can enable models to learn how to learn and modify themselves.

Method: Proposes Nested Learning (NL) paradigm representing models as nested/multi-level optimization problems. Presents three core contributions: 1) Expressive optimizers as associative memory modules, 2) Self-modifying learning modules that learn their own update algorithms, and 3) Continuum memory system generalizing traditional memory concepts. Combines these into “Hope” continual learning module.

Result: The Hope continual learning module shows promising results in language modeling, knowledge incorporation, few-shot generalization, continual learning, and long-context reasoning tasks, demonstrating the practical effectiveness of the NL paradigm.

Conclusion: Nested Learning provides a coherent framework for designing more expressive learning algorithms with higher-order in-context learning capabilities, potentially unlocking effective continual learning and self-improvement in machine learning models.

Abstract: Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients’ information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL’s insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.

[427] Many Minds from One Model: Bayesian Transformers for Population Intelligence

Diji Yang, Yi Zhang

Main category: cs.LG

TL;DR: Population Bayesian Transformers (B-Trans) convert standard LLMs into Bayesian models that can sample diverse model instances from a single pre-trained weight set, enabling population-level decision-making for better performance and diversity.

Details

Motivation: Current transformers are trained as single deterministic systems, but intelligence emerges from many minds. The authors want to create models that can represent diverse hypotheses about data while maintaining general competence.

Method: B-Trans treats bias-like offsets in normalization layers as stochastic variables with Gaussian variational approximation, creating a Bayesian posterior proxy. Samples from this proxy yield diverse model instances. To maintain coherence, sampled noise is frozen at sequence level for temporal consistency.

Result: Experiments show B-Trans enhances exploration through population-level decision-making, achieving superior semantic diversity and better task performance in zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels compared to deterministic baselines.

Conclusion: B-Trans successfully transforms standard LLMs into Bayesian transformers that leverage the “wisdom of crowds” through population sampling, demonstrating that diverse model instances from a single weight set can improve both diversity and performance.

Abstract: Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.

[428] Causal Discovery with Mixed Latent Confounding via Precision Decomposition

Amir Asiaee, Samhita Pal, James O’quinn, James P. Long

Main category: cs.LG

TL;DR: DCL-DECOR: A precision-led pipeline for causal discovery in linear Gaussian systems with mixed latent confounding, separating pervasive global confounders from local dependencies to improve directed edge recovery.

Details

Motivation: Existing causal discovery methods struggle with mixed latent confounding where some unobserved factors affect many variables (global) while others affect only small subsets (local). Differentiable DAG learners misinterpret global latent effects as causal edges, while latent-variable graphical models only recover undirected structure.

Method: A modular pipeline that: 1) isolates pervasive latent effects by decomposing the observed precision matrix into structured and low-rank components, 2) applies correlated-noise DAG learning to the deconfounded representation to recover directed edges while modeling remaining structured error correlations, 3) performs a reconciliation step to enforce bow-freeness.

Result: Synthetic experiments show consistent improvements in directed edge recovery over applying correlated-noise DAG learning directly to confounded data, particularly when varying the strength and dimensionality of pervasive confounding.

Conclusion: DCL-DECOR effectively addresses mixed latent confounding by separating global and local confounding effects, providing a practical solution for causal discovery in real-world settings where both types of confounding commonly occur.

Abstract: We study causal discovery from observational data in linear Gaussian systems affected by \emph{mixed latent confounding}, where some unobserved factors act broadly across many variables while others influence only small subsets. This setting is common in practice and poses a challenge for existing methods: differentiable and score-based DAG learners can misinterpret global latent effects as causal edges, while latent-variable graphical models recover only undirected structure. We propose \textsc{DCL-DECOR}, a modular, precision-led pipeline that separates these roles. The method first isolates pervasive latent effects by decomposing the observed precision matrix into a structured component and a low-rank component. The structured component corresponds to the conditional distribution after accounting for pervasive confounders and retains only local dependence induced by the causal graph and localized confounding. A correlated-noise DAG learner is then applied to this deconfounded representation to recover directed edges while modeling remaining structured error correlations, followed by a simple reconciliation step to enforce bow-freeness. We provide identifiability results that characterize the recoverable causal target under mixed confounding and show how the overall problem reduces to well-studied subproblems with modular guarantees. Synthetic experiments that vary the strength and dimensionality of pervasive confounding demonstrate consistent improvements in directed edge recovery over applying correlated-noise DAG learning directly to the confounded data.

[429] Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Main category: cs.LG

TL;DR: Researchers train language models for open-ended forecasting using automated news-based question synthesis, creating OpenForesight dataset and OpenForecaster 8B model that matches larger proprietary models in accuracy and calibration.

Details

Motivation: High-stakes decision making requires reasoning under uncertainty about future events, but current language models lack specialized training for open-ended forecasting tasks. The researchers aim to create accessible forecasting capabilities by developing specialized models trained on news-derived forecasting questions.

Method: 1. Automatically synthesize forecasting questions from daily global news events using careful curation recipe; 2. Train Qwen3 thinking models on OpenForesight dataset; 3. Use offline news corpus to prevent future information leakage; 4. Implement retrieval mechanisms and improved reward function for RL; 5. Evaluate on held-out test set from May-August 2025.

Result: OpenForecaster 8B matches performance of much larger proprietary models, with improved accuracy, calibration, and consistency. Calibration improvements from forecasting training generalize across popular benchmarks. All models, code, and data are open-sourced.

Conclusion: Specialized forecasting training on news-derived questions enables language models to achieve competitive forecasting performance while maintaining calibration. The open-source approach makes forecasting research broadly accessible for high-stakes decision making applications.

Abstract: High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

[430] BandiK: Efficient Multi-Task Decomposition Using a Multi-Bandit Framework

András Millinghoffer, András Formanek, András Antos, Péter Antal

Main category: cs.LG

TL;DR: BandiK is a three-stage multi-task auxiliary task selection method using multi-bandits that efficiently identifies beneficial auxiliary task sets for each target task by estimating pairwise transfers, constructing linear candidate sets, and employing a novel multi-bandit framework.

Details

Motivation: The paper addresses the challenge of knowledge transfer in multi-task learning and foundation models, where negative transfer remains a significant obstacle. Current methods face high computational costs for evaluating auxiliary task sets, exponential candidate sets, and varying selection complexity across tasks.

Method: BandiK uses a three-stage approach: 1) Estimates pairwise transfers between tasks to identify beneficial joint learning opportunities, 2) Constructs linear candidate auxiliary sets (instead of exponential) for each target task, 3) Employs a Multi-Armed Bandit framework where arms represent candidate auxiliary sets realized as multiple output neural networks. The method integrates individual task-specific MABs into a multi-bandit structure that exploits semi-overlapping arms across bandits.

Result: The method significantly reduces the computational cost of evaluating auxiliary task sets by constructing linear rather than exponential candidate sets and using an efficient multi-bandit framework that leverages shared neural network realizations across different bandits.

Conclusion: BandiK provides an efficient solution to the auxiliary task selection problem in multi-task learning by combining pairwise transfer estimation, linear candidate set construction, and a novel multi-bandit framework that addresses computational constraints while improving knowledge transfer across tasks.

Abstract: The challenge of effectively transferring knowledge across multiple tasks is of critical importance and is also present in downstream tasks with foundation models. However, the nature of transfer, its transitive-intransitive nature, is still an open problem, and negative transfer remains a significant obstacle. Selection of beneficial auxiliary task sets in multi-task learning is frequently hindered by the high computational cost of their evaluation, the high number of plausible candidate auxiliary sets, and the varying complexity of selection across target tasks. To address these constraints, we introduce BandiK, a novel three-stage multi-task auxiliary task subset selection method using multi-bandits, where each arm pull evaluates candidate auxiliary sets by training and testing a multiple output neural network on a single random train-test dataset split. Firstly, BandiK estimates the pairwise transfers between tasks, which helps in identifying which tasks are likely to benefit from joint learning. In the second stage, it constructs a linear number of candidate sets of auxiliary tasks (in the number of all tasks) for each target task based on the initial estimations, significantly reducing the exponential number of potential auxiliary task sets. Thirdly, it employs a Multi-Armed Bandit (MAB) framework for each task, where the arms correspond to the performance of candidate auxiliary sets realized as multiple output neural networks over train-test data set splits. To enhance efficiency, BandiK integrates these individual task-specific MABs into a multi-bandit structure. The proposed multi-bandit solution exploits that the same neural network realizes multiple arms of different individual bandits corresponding to a given candidate set. This semi-overlapping arm property defines a novel multi-bandit cost/reward structure utilized in BandiK.

[431] FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference

Fen-Yu Hsieh, Yun-Chang Teng, Ding-Yong Hong, Jan-Jan Wu

Main category: cs.LG

TL;DR: A framework combining N:M structured pruning and 4-bit quantization to compress LLMs, with hardware-software co-design for efficient inference on CPUs, GPUs, and custom FPGA accelerators.

Details

Motivation: LLMs have high computation and memory requirements that hinder deployment in resource-constrained environments, necessitating efficient compression and acceleration techniques.

Method: Unified pipeline applying N:M structured pruning and 4-bit integer quantization, with optimized dequantization and matrix multiplication for multiple hardware platforms (CPUs, GPUs with Dense/Sparse Tensor Cores, custom FPGA accelerator).

Result: Achieves 4× weight storage reduction, 1.71× matrix multiplication speedup, 1.29× end-to-end latency reduction vs dense GPU baselines; 1.36× throughput per token improvement on LLaMA-7B.

Conclusion: Fine-grained N:M sparsity and quantization synergize for efficient LLM inference; FPGA accelerator provides flexible architectural support for diverse sparsity patterns beyond fixed hardware constraints.

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes their deployment in resource-constrained environments. To address this challenge, this work introduces an automation framework that leverages weight pruning and low-bit quantization, and presents a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform. In particular, we implement a unified pipeline that applies N:M structured pruning and 4-bit integer quantization to reduce the memory footprint, followed by optimized dequantization and matrix multiplication to enhance LLM inference on several hardware platforms, including CPUs, NVIDIA GPUs with Dense and 2:4 Sparse Tensor Cores, and a custom systolic-array-based FPGA accelerator. Utilizing 2:4 sparsity combined with quantization on $4096 \times 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines. Scaling analysis on the LLaMA-7B model further shows that structured sparsity enhances the throughput per token by $1.36\times$. These results demonstrate the synergy of fine-grained N:M sparsity and quantization for enabling efficient and deployable LLM inference, while the proposed FPGA accelerator offers a flexible architectural path for supporting a broader class of sparsity patterns beyond the fixed 2:4 hardware constraints.

[432] From Trial to Deployment: A SEM Analysis of Traveler Adoptions to Fully Operational Autonomous Taxis

Yutong Cai, Hua Wang

Main category: cs.LG

TL;DR: Study examines real-world autonomous taxi adoption using survey data from actual users of Baidu’s Apollo Robotaxi in Wuhan, China, identifying key psychological factors influencing adoption behavior.

Details

Motivation: Most existing research on autonomous taxi acceptance relies on hypothetical scenarios and stated preferences, lacking investigation of actual user behavior based on operational autonomous vehicle services.

Method: Surveyed 336 actual users of Baidu’s Apollo Robotaxi in Wuhan, designed realistic survey with actual service attributes, used Structural Equation Modeling to analyze six latent psychological constructs influencing adoption behavior measured by selection frequency in ten scenarios.

Result: Cost Sensitivity and Behavioral Intention are strongest positive predictors of autonomous taxi adoption, while Trust & Policy Support, Performance, Lifestyle, and Education play more nuanced roles. Model shows strong goodness-of-fit across multiple indices.

Conclusion: Findings provide empirical evidence to support policymaking, fare design, and public outreach strategies for scaling autonomous taxi deployments in real-world urban settings.

Abstract: Autonomous taxi services represent a transformative advancement in urban mobility, offering safety, efficiency, and round-the-clock operations. While existing literature has explored user acceptance of autonomous taxis through stated preference experiments and hypothetical scenarios, few studies have investigated actual user behavior based on operational AV services. This study addresses that gap by leveraging survey data from Wuhan, China, where Baidu’s Apollo Robotaxi service operates at scale. We design a realistic survey incorporating actual service attributes and collect 336 valid responses from actual users. Using Structural Equation Modeling, we identify six latent psychological constructs, namely Trust & Policy Support, Cost Sensitivity, Performance, Behavioral Intention, Lifestyle, and Education. Their influences on adoption behavior, measured by the selection frequency of autonomous taxis in ten scenarios, are examined and interpreted. Results show that Cost Sensitivity and Behavioral Intention are the strongest positive predictors of adoption, while other latent constructs play more nuanced roles. The model demonstrates strong goodness-of-fit across multiple indices. Our findings offer empirical evidence to support policymaking, fare design, and public outreach strategies for scaling autonomous taxis deployments in real-world urban settings.

[433] Gradient Descent as Implicit EM in Distance-Based Neural Models

Alan Oursland

Main category: cs.LG

TL;DR: Gradient descent on log-sum-exp objectives implicitly performs expectation-maximization, making optimization and inference the same process across unsupervised learning, attention, and classification.

Details

Motivation: To provide a direct mathematical derivation explaining why neural networks trained with standard objectives exhibit probabilistic inference behaviors (soft clustering, prototype specialization, Bayesian uncertainty tracking) across various architectures, rather than relying on loose analogies or post-hoc interpretations.

Method: Mathematical derivation showing that for any objective with log-sum-exp structure over distances or energies, the gradient with respect to each distance equals the negative posterior responsibility of the corresponding component: ∂L/∂d_j = -r_j. This algebraic identity reveals that gradient descent implicitly performs expectation-maximization.

Result: Demonstrates that gradient descent on log-sum-exp objectives inherently performs probabilistic inference, unifying unsupervised mixture modeling, attention mechanisms, and cross-entropy classification under a single mechanism. The Bayesian structure in transformers is shown to be a necessary consequence of objective geometry rather than an emergent property.

Conclusion: Optimization and inference are fundamentally the same process in neural networks trained with log-sum-exp objectives. The probabilistic behaviors observed in trained models are not emergent properties but necessary mathematical consequences of the objective function’s geometry.

Abstract: Neural networks trained with standard objectives exhibit behaviors characteristic of probabilistic inference: soft clustering, prototype specialization, and Bayesian uncertainty tracking. These phenomena appear across architectures – in attention mechanisms, classification heads, and energy-based models – yet existing explanations rely on loose analogies to mixture models or post-hoc architectural interpretation. We provide a direct derivation. For any objective with log-sum-exp structure over distances or energies, the gradient with respect to each distance is exactly the negative posterior responsibility of the corresponding component: $\partial L / \partial d_j = -r_j$. This is an algebraic identity, not an approximation. The immediate consequence is that gradient descent on such objectives performs expectation-maximization implicitly – responsibilities are not auxiliary variables to be computed but gradients to be applied. No explicit inference algorithm is required because inference is embedded in optimization. This result unifies three regimes of learning under a single mechanism: unsupervised mixture modeling, where responsibilities are fully latent; attention, where responsibilities are conditioned on queries; and cross-entropy classification, where supervision clamps responsibilities to targets. The Bayesian structure recently observed in trained transformers is not an emergent property but a necessary consequence of the objective geometry. Optimization and inference are the same process.

[434] Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks

Shota Suzuki, Satoshi Ono

Main category: cs.LG

TL;DR: Proposes a self-supervised learning method for neural architecture search of multimodal DNNs that works with unlabeled data.

Details

Motivation: Multimodal DNNs benefit from NAS but require substantial labeled data for architecture search, creating a data bottleneck.

Method: Uses self-supervised learning comprehensively for both architecture search and model pretraining processes, enabling NAS from unlabeled data.

Result: Experimental results show the method successfully designs architectures for DNNs from unlabeled training data.

Conclusion: SSL enables effective neural architecture search for multimodal DNNs without requiring labeled data, addressing the data bottleneck problem.

Abstract: Neural architecture search (NAS), which automates the architectural design process of deep neural networks (DNN), has attracted increasing attention. Multimodal DNNs that necessitate feature fusion from multiple modalities benefit from NAS due to their structural complexity; however, constructing an architecture for multimodal DNNs through NAS requires a substantial amount of labeled training data. Thus, this paper proposes a self-supervised learning (SSL) method for architecture search of multimodal DNNs. The proposed method applies SSL comprehensively for both the architecture search and model pretraining processes. Experimental results demonstrated that the proposed method successfully designed architectures for DNNs from unlabeled training data.

[435] DTI-GP: Bayesian operations for drug-target interactions using deep kernel Gaussian processes

Bence Bolgár, András Millinghoffer, Péter Antal

Main category: cs.LG

TL;DR: DTI-GP: A deep kernel learning Gaussian process model for drug-target interaction prediction with Bayesian uncertainty quantification, enabling rejection schemes, top-K selection, and ranking operations.

Details

Motivation: Need for precise probabilistic information in DTI predictions to understand limitations and boost performance. Current methods lack proper uncertainty quantification needed for reliable decision-making in drug discovery.

Method: Deep kernel learning-based GP architecture with neural embedding module for compounds and proteins, combined with GP module for Bayesian inference. Uses predictive distribution sampling to estimate Bayesian precedence matrix for selection/ranking operations.

Result: DTI-GP outperforms state-of-the-art solutions. Enables construction of Bayesian accuracy-confidence enrichment score, rejection schemes for improved enrichment, and estimation/search for top-K selections with high expected utility.

Conclusion: Gaussian processes provide scalable framework for integrating DTI representations with Bayesian inference, enabling novel operations like Bayesian classification with rejection, top-K selection, and ranking for improved drug discovery workflows.

Abstract: Precise probabilistic information about drug-target interaction (DTI) predictions is vital for understanding limitations and boosting predictive performance. Gaussian processes (GP) offer a scalable framework to integrate state-of-the-art DTI representations and Bayesian inference, enabling novel operations, such as Bayesian classification with rejection, top-$K$ selection, and ranking. We propose a deep kernel learning-based GP architecture (DTI-GP), which incorporates a combined neural embedding module for chemical compounds and protein targets, and a GP module. The workflow continues with sampling from the predictive distribution to estimate a Bayesian precedence matrix, which is used in fast and accurate selection and ranking operations. DTI-GP outperforms state-of-the-art solutions, and it allows (1) the construction of a Bayesian accuracy-confidence enrichment score, (2) rejection schemes for improved enrichment, and (3) estimation and search for top-$K$ selections and ranking with high expected utility.

[436] Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Main category: cs.LG

TL;DR: OMWU achieves last-iterate linear convergence to Nash equilibrium in NLHF without requiring NE uniqueness, with exponentially better dependence on instance-dependent constants.

Details

Motivation: Standard preference modeling assumes transitivity, overlooking complex human preferences. NLHF addresses non-transitive preferences via game theory, but existing algorithms use regularization causing bias in duality gap computation.

Method: Use Optimistic Multiplicative Weights Update (OMWU) in Nash learning from human feedback framework, analyzing convergence without regularization or NE uniqueness assumption.

Result: OMWU achieves last-iterate linear convergence after burn-in phase when full-support NE exists, with instance-dependent linear convergence rate to original NE measured by duality gaps.

Conclusion: OMWU provides superior theoretical guarantees for NLHF alignment without regularization bias, with practical potential for LLM applications as demonstrated in experiments.

Abstract: Aligning large language models (LLMs) with human preferences has proven effective for enhancing model capabilities, yet standard preference modeling using the Bradley-Terry model assumes transitivity, overlooking the inherent complexity of human population preferences. Nash learning from human feedback (NLHF) addresses this by framing non-transitive preferences as a two-player zero-sum game, where alignment reduces to finding the Nash equilibrium (NE). However, existing algorithms typically rely on regularization, incurring unavoidable bias when computing the duality gap in the original game. In this work, we provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($\mathtt{OMWU}$) in NLHF, showing that it achieves last-iterate linear convergence after a burn-in phase whenever an NE with full support exists, with an instance-dependent linear convergence rate to the original NE, measured by duality gaps. Compared to prior results in Wei et al. (2020), we do not require the assumption of NE uniqueness. Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values, enabling exponentially better dependence on instance-dependent constants than prior results. Experiments corroborate the theoretical strengths of $\mathtt{OMWU}$ in both tabular and neural policy classes, demonstrating its potential for LLM applications.

[437] Discovering Coordinated Joint Options via Inter-Agent Relative Dynamics

Raul D. Steleac, Mohan Sridharan, David Abel

Main category: cs.LG

TL;DR: Proposes a novel multi-agent option discovery method using joint-state abstraction and neural graph Laplacian to discover strongly coordinated behaviors by capturing state synchronization patterns between agents.

Details

Motivation: Multi-agent settings face exponential growth of joint state space, making coordinated behaviors valuable but challenging to design. Existing methods sacrifice coordination by producing loosely coupled or independent behaviors.

Method: 1) Proposes joint-state abstraction that compresses state space while preserving coordination information; 2) Uses inductive bias that synchronization over agent states provides foundation for coordination; 3) Approximates Fermat state (maximal alignment with team) to define spreadness measure; 4) Employs neural graph Laplacian estimator to derive options capturing state synchronization patterns.

Result: Evaluated across multiple scenarios in two multi-agent domains, showing that the discovered options yield stronger downstream coordination capabilities compared to alternative option discovery methods.

Conclusion: The approach successfully addresses limitations of existing multi-agent option discovery methods by producing strongly coordinated behaviors through state synchronization patterns, overcoming the exponential complexity of joint state spaces.

Abstract: Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the \textit{Fermat} state, and use it to define a measure of \textit{spreadness}, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.

[438] AODDiff: Probabilistic Reconstruction of Aerosol Optical Depth via Diffusion-based Bayesian Inference

Linhao Fan, Hongqiang Fang, Jingyang Dai, Yong Jiang, Qixing Zhang

Main category: cs.LG

TL;DR: AODDiff: A diffusion-based probabilistic framework for reconstructing Aerosol Optical Depth fields with uncertainty quantification, using learned spatiotemporal priors from incomplete data and flexible adaptation to various reconstruction tasks.

Details

Motivation: Current AOD reconstruction models are limited by scarce complete training data and lack uncertainty quantification, which is critical for atmospheric monitoring applications.

Method: Proposes AODDiff with two key components: 1) corruption-aware training to learn spatiotemporal AOD priors from naturally incomplete data, and 2) decoupled annealing posterior sampling to integrate heterogeneous observations as constraints for guided generation.

Result: Validated on reanalysis data, AODDiff demonstrates efficacy and robustness across downscaling and inpainting tasks, maintaining high spatial spectral fidelity and enabling uncertainty quantification through multiple sampling.

Conclusion: AODDiff provides a flexible probabilistic framework for AOD reconstruction that addresses data scarcity issues, enables uncertainty quantification, and can be adapted to various reconstruction tasks without retraining.

Abstract: High-quality reconstruction of Aerosol Optical Depth (AOD) fields is critical for Atmosphere monitoring, yet current models remain constrained by the scarcity of complete training data and a lack of uncertainty quantification.To address these limitations, we propose AODDiff, a probabilistic reconstruction framework based on diffusion-based Bayesian inference. By leveraging the learned spatiotemporal probability distribution of the AOD field as a generative prior, this framework can be flexibly adapted to various reconstruction tasks without requiring task-specific retraining. We first introduce a corruption-aware training strategy to learns a spatiotemporal AOD prior solely from naturally incomplete data. Subsequently, we employ a decoupled annealing posterior sampling strategy that enables the more effective and integration of heterogeneous observations as constraints to guide the generation process. We validate the proposed framework through extensive experiments on Reanalysis data. Results across downscaling and inpainting tasks confirm the efficacy and robustness of AODDiff, specifically demonstrating its advantage in maintaining high spatial spectral fidelity. Furthermore, as a generative model, AODDiff inherently enables uncertainty quantification via multiple sampling, offering critical confidence metrics for downstream applications.

[439] Characterization of Transfer Using Multi-task Learning Curves

András Millinghoffer, Bence Bolgár, Péter Antal

Main category: cs.LG

TL;DR: The paper proposes using multi-task learning curves over varying sample sizes to model transfer effects, showing this data perturbation approach is more fundamental than model gradient updates and offers better power with lower compute costs.

Details

Motivation: Current approaches to studying transfer effects focus on perturbing models through gradient updates during training. The authors hypothesize that perturbing the dataset by varying sample sizes provides a more fundamental characterization of transfer effects in both training and inductive inference.

Method: Developed multi-task learning curves to approximate inductive performance over varying sample sizes. Created an efficient method analogous to Task Affinity Grouping to approximate these curves. Compared statistical (data perturbation) vs computational (model perturbation) approaches to transfer.

Result: Learning curves better capture multi-task learning effects than previous methods. Multi-task extensions of learning curves can delineate pairwise and contextual transfer effects in foundation models. Statistical approach has lower compute costs but better power and broader applicability than computational approach.

Conclusion: Modeling transfer effects through data perturbation via multi-task learning curves provides a more fundamental characterization than model perturbation approaches, offering better analytical power with reduced computational costs for understanding transfer in foundation models.

Abstract: Transfer effects manifest themselves both during training using a fixed data set and in inductive inference using accumulating data. We hypothesize that perturbing the data set by including more samples, instead of perturbing the model by gradient updates, provides a complementary and more fundamental characterization of transfer effects. To capture this phenomenon, we quantitatively model transfer effects using multi-task learning curves approximating the inductive performance over varying sample sizes. We describe an efficient method to approximate multi-task learning curves analogous to the Task Affinity Grouping method applied during training. We compare the statistical and computational approaches to transfer, which indicates considerably higher compute costs for the previous but better power and broader applicability. Evaluations are performed using a benchmark drug-target interaction data set. Our results show that learning curves can better capture the effects of multi-task learning and their multi-task extensions can delineate pairwise and contextual transfer effects in foundation models.

[440] PRISM: A hierarchical multiscale approach for time series forecasting

Zihao Chen, Alexandre Andre, Wenrui Ma, Ian Knight, Sergey Shuvaev, Eva Dyer

Main category: cs.LG

TL;DR: PRISM introduces a hierarchical tree-based partitioning method for time series forecasting that captures multi-scale features through time-frequency analysis, outperforming state-of-the-art methods.

Details

Motivation: Real-world time series contain complex multi-scale patterns (global trends, local structure, intermediate features) that make accurate forecasting challenging. Existing methods struggle to capture this hierarchical complexity effectively.

Method: PRISM uses learnable tree-based partitioning where the root captures global trends and recursive splits reveal localized views. At each level, data is projected onto time-frequency bases (wavelets/exponential moving averages) to extract scale-specific features, which are aggregated across the hierarchy.

Result: Experiments across benchmark datasets show PRISM outperforms state-of-the-art forecasting methods, demonstrating superior accuracy in capturing both global structure and local dynamics.

Conclusion: The hierarchical tree-based approach provides a lightweight and flexible framework for multivariate time series forecasting that effectively handles multi-scale patterns in real-world data.

Abstract: Forecasting is critical in areas such as finance, biology, and healthcare. Despite the progress in the field, making accurate forecasts remains challenging because real-world time series contain both global trends, local fine-grained structure, and features on multiple scales in between. Here, we present a new forecasting method, PRISM (Partitioned Representation for Iterative Sequence Modeling), that addresses this challenge through a learnable tree-based partitioning of the signal. At the root of the tree, a global representation captures coarse trends in the signal, while recursive splits reveal increasingly localized views of the signal. At each level of the tree, data are projected onto a time-frequency basis (e.g., wavelets or exponential moving averages) to extract scale-specific features, which are then aggregated across the hierarchy. This design allows the model to jointly capture global structure and local dynamics of the signal, enabling accurate forecasting. Experiments across benchmark datasets show that our method outperforms state-of-the-art methods for forecasting. Overall, these results demonstrate that our hierarchical approach provides a lightweight and flexible framework for forecasting multivariate time series. The code is available at https://github.com/nerdslab/prism.

[441] Spectral Graph Neural Networks for Cognitive Task Classification in fMRI Connectomes

Debasis Maji, Arghya Banerjee, Debaditya Barman

Main category: cs.LG

TL;DR: SpectralBrainGNN uses spectral graph neural networks with graph Fourier transforms to classify cognitive tasks from fMRI brain connectivity data, achieving 96.25% accuracy on HCP-Task dataset.

Details

Motivation: Current approaches for cognitive task classification from neuroimaging data often miss complex topological dependencies and multi-scale interactions in brain connectivity patterns. There's a need for better methods that can capture these intricate relationships in functional brain networks.

Method: Proposed SpectralBrainGNN model uses spectral convolution framework based on graph Fourier transforms computed via normalized Laplacian eigendecomposition. Models brain regions as nodes and functional connections as edges in graph neural networks to capture topological dependencies.

Result: Achieved 96.25% classification accuracy on Human Connectome Project-Task (HCP-Task) dataset, demonstrating superior performance in decoding cognitive states from fMRI connectomes.

Conclusion: SpectralBrainGNN effectively captures complex brain connectivity patterns for cognitive task classification, outperforming conventional approaches. The publicly available implementation supports reproducibility and future research in brain network analysis.

Abstract: Cognitive task classification using machine learning plays a central role in decoding brain states from neuroimaging data. By integrating machine learning with brain network analysis, complex connectivity patterns can be extracted from functional magnetic resonance imaging connectomes. This process transforms raw blood-oxygen-level-dependent (BOLD) signals into interpretable representations of cognitive processes. Graph neural networks (GNNs) further advance this paradigm by modeling brain regions as nodes and functional connections as edges, capturing topological dependencies and multi-scale interactions that are often missed by conventional approaches. Our proposed SpectralBrainGNN model, a spectral convolution framework based on graph Fourier transforms (GFT) computed via normalized Laplacian eigendecomposition. Experiments on the Human Connectome Project-Task (HCPTask) dataset demonstrate the effectiveness of the proposed approach, achieving a classification accuracy of 96.25%. The implementation is publicly available at https://github.com/gnnplayground/SpectralBrainGNN to support reproducibility and future research.

[442] Frequent subgraph-based persistent homology for graph classification

Xinyang Chen, Amaël Broustet, Guoting Chen

Main category: cs.LG

TL;DR: Proposes Frequent Subgraph Filtration (FSF) for persistent homology on graphs, creating frequency-based persistent homology features, with two classification frameworks: FPH-ML and FPH-GNNs that integrate with graph neural networks.

Details

Motivation: Current persistent homology methods on graphs use limited filtrations (degree/weight-based) that miss richer features like recurring patterns across datasets, restricting expressive power. Need more informative topology-aware features.

Method: Introduces Frequent Subgraph Filtration (FSF) derived from frequent subgraphs, producing stable frequency-based persistent homology features. Two approaches: 1) FPH-ML (machine learning model using FPH features), 2) FPH-GNNs (hybrid framework integrating FPH with graph neural networks).

Result: FPH-ML achieves competitive/superior accuracy vs kernel/degree-based filtration methods. FPH-GNNs yield 0.4-21% relative performance gains, with up to 8.2 percentage point improvements over GCN/GIN backbones across benchmarks.

Conclusion: FSF bridges frequent subgraph mining and topological data analysis, offering new perspective on topology-aware feature extraction. The approach enhances graph representation learning by incorporating richer topological information.

Abstract: Persistent homology (PH) has recently emerged as a powerful tool for extracting topological features. Integrating PH into machine learning and deep learning models enhances topology awareness and interpretability. However, most PH methods on graphs rely on a limited set of filtrations, such as degree-based or weight-based filtrations, which overlook richer features like recurring information across the dataset and thus restrict expressive power. In this work, we propose a novel graph filtration called Frequent Subgraph Filtration (FSF), which is derived from frequent subgraphs and produces stable and information-rich frequency-based persistent homology (FPH) features. We study the theoretical properties of FSF and provide both proofs and experimental validation. Beyond persistent homology itself, we introduce two approaches for graph classification: an FPH-based machine learning model (FPH-ML) and a hybrid framework that integrates FPH with graph neural networks (FPH-GNNs) to enhance topology-aware graph representation learning. Our frameworks bridge frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction. Experimental results show that FPH-ML achieves competitive or superior accuracy compared with kernel-based and degree-based filtration methods. When integrated into graph neural networks, FPH yields relative performance gains ranging from 0.4 to 21 percent, with improvements of up to 8.2 percentage points over GCN and GIN backbones across benchmarks.

[443] MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control

Yongwei Zhang, Yuanzhe Xing, Quan Quan, Zhikun She

Main category: cs.LG

TL;DR: MSACL is a model-free RL framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning, achieving provable stability without complex reward engineering.

Details

Motivation: Achieving provable stability in model-free RL is challenging, especially in balancing exploration with rigorous safety. Existing methods often rely on complex reward engineering, and there's a need for frameworks that can ensure stability while maintaining learning efficiency.

Method: MSACL integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. It uses off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions, introduces Exponential Stability Labels (ESL) and a λ-weighted aggregation mechanism to balance bias-variance trade-off, and guides policy optimization with a stability-aware advantage function.

Result: MSACL demonstrates superiority over state-of-the-art Lyapunov-based RL algorithms across six benchmarks (stabilization and nonlinear tracking tasks). It achieves exponential stability and rapid convergence under simple rewards, exhibits significant robustness to uncertainties, and generalizes well to unseen trajectories. Sensitivity analysis establishes multi-step horizon n=20 as a robust default.

Conclusion: MSACL provides a foundation for verifiably safe learning-based control by linking Lyapunov theory with off-policy actor-critic frameworks. The framework enables provable stability without complex reward engineering and offers practical implementation with publicly available source code.

Abstract: Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $λ$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.

[444] Semi-overlapping Multi-bandit Best Arm Identification for Sequential Support Network Learning

András Antos, András Millinghoffer, Péter Antal

Main category: cs.LG

TL;DR: SSNL framework unifies sequential partner selection problems; SOMMAB model enables efficient learning of support networks from sparse candidates with shared evaluations; new GapE algorithm with improved exponential error bounds.

Details

Motivation: Many AI/ML problems require evaluating partners' contributions through asymmetric, computationally intensive processes while selecting the most beneficial candidates. Sequential approaches to these problems need a unified framework for efficient partner selection.

Method: Proposes Sequential Support Network Learning (SSNL) framework and semi-overlapping multi-(multi-armed) bandit (SOMMAB) model where single evaluations provide distinct feedback to multiple bandits due to structural overlap. Develops generalized GapE algorithm for SOMMABs.

Result: Derives new exponential error bounds that improve the best known constant in the exponent for multi-bandit best-arm identification. Bounds scale linearly with overlap degree, showing significant sample-complexity gains from shared evaluations.

Conclusion: Provides theoretical foundation and improved performance guarantees for sequential learning tools for identifying support networks from sparse candidates in multi-task learning, auxiliary task learning, federated learning, and multi-agent systems.

Abstract: Many modern AI and ML problems require evaluating partners’ contributions through shared yet asymmetric, computationally intensive processes and the simultaneous selection of the most beneficial candidates. Sequential approaches to these problems can be unified under a new framework, Sequential Support Network Learning (SSNL), in which the goal is to select the most beneficial candidate set of partners for all participants using trials; that is, to learn a directed graph that represents the highest-performing contributions. We demonstrate that a new pure-exploration model, the semi-overlapping multi-(multi-armed) bandit (SOMMAB), in which a single evaluation provides distinct feedback to multiple bandits due to structural overlap among their arms, can be used to learn a support network from sparse candidate lists efficiently. We develop a generalized GapE algorithm for SOMMABs and derive new exponential error bounds that improve the best known constant in the exponent for multi-bandit best-arm identification. The bounds scale linearly with the degree of overlap, revealing significant sample-complexity gains arising from shared evaluations. From an application point of view, this work provides a theoretical foundation and improved performance guarantees for sequential learning tools for identifying support networks from sparse candidates in multiple learning problems, such as in multi-task learning (MTL), auxiliary task learning (ATL), federated learning (FL), and in multi-agent systems (MAS).

[445] Attribution-Guided Distillation of Matryoshka Sparse Autoencoders

Cristina P. Martin-Linares, Jonathan P. Ling

Main category: cs.LG

TL;DR: DMSAEs distill a compact core of consistently useful features from SAEs and reuse it to train new SAEs, improving feature consistency and transferability across sparsity levels.

Details

Motivation: Sparse autoencoders (SAEs) often produce redundant features that vary across training runs and sparsity levels, making interpretations difficult to transfer and reuse.

Method: Iterative distillation cycle: train Matryoshka SAE with shared core, use gradient X activation to measure feature contributions, keep smallest subset explaining fixed fraction of attribution, transfer only core encoder weights across cycles.

Result: On Gemma-2-2B, seven distillation cycles yielded a distilled core of 197 consistently selected features, improving SAEBench metrics and demonstrating feature transferability across sparsity levels.

Conclusion: DMSAEs successfully distill reusable feature cores that improve SAE consistency and enable feature transfer across different training conditions.

Abstract: Sparse autoencoders (SAEs) aim to disentangle model activations into monosemantic, human-interpretable features. In practice, learned features are often redundant and vary across training runs and sparsity levels, which makes interpretations difficult to transfer and reuse. We introduce Distilled Matryoshka Sparse Autoencoders (DMSAEs), a training pipeline that distills a compact core of consistently useful features and reuses it to train new SAEs. DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature’s contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution. Only the core encoder weight vectors are transferred across cycles; the core decoder and all non-core latents are reinitialized each time. On Gemma-2-2B layer 12 residual stream activations, seven cycles of distillation (500M tokens, 65k width) yielded a distilled core of 197 features that were repeatedly selected. Training using this distilled core improves several SAEBench metrics and demonstrates that consistent sets of latent features can be transferred across sparsity levels

[446] Efficiently Estimating Data Efficiency for Language Model Fine-tuning

Gyung Hyun Je, Colin Raffel

Main category: cs.LG

TL;DR: The paper proposes a method to predict how much fine-tuning data a task needs (data efficiency) using gradient cosine similarity of low-confidence examples, eliminating costly annotation cycles.

Details

Motivation: LLMs often need fine-tuning for specialized tasks, but it's hard to know how much annotation data is needed, leading to expensive cycles of incremental annotation and retraining. The paper shows that performant LLMs may struggle zero-shot but improve with fine-tuning, creating a need to predict data efficiency upfront.

Method: Introduces a metric to quantify task data efficiency, then proposes using gradient cosine similarity of low-confidence examples from a small number of labeled samples to predict how much fine-tuning data a task requires.

Result: Validated on 30 diverse tasks, achieving 8.6% error in data efficiency prediction, typically eliminating hundreds of unnecessary annotations per task.

Conclusion: The proposed method effectively predicts task data efficiency using minimal labeled data, reducing annotation costs and eliminating wasteful incremental annotation cycles.

Abstract: While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task’s data efficiency–i.e., the number of fine-tuning examples needed to achieve a desired level of performance–is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task’s data efficiency without requiring incremental annotation. After introducing a concrete metric that quantifies a task’s data efficiency, we propose using the gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. We validate our approach on a diverse set of tasks with varying data efficiencies, attaining 8.6% error in overall data efficiency prediction and typically eliminating hundreds of unnecessary annotations on each task. Our experiment results and implementation code are available on GitHub.

[447] Diffusion Language Models are Provably Optimal Parallel Samplers

Haozhe Jiang, Nika Haghtalab, Lijie Chen

Main category: cs.LG

TL;DR: DLMs with chain-of-thought can achieve optimal parallel sampling steps, but need revision/remasking for optimal space efficiency.

Details

Motivation: To provide rigorous theoretical foundation for diffusion language models as efficient parallel samplers compared to autoregressive models, and to understand their computational advantages and limitations.

Method: Formalize parallel sampling model, analyze DLMs with chain-of-thought, prove optimal step complexity, examine space complexity with remasking/revision operations, establish expressivity hierarchy.

Result: DLMs with CoT can simulate any parallel sampling algorithm with optimal sequential steps; remasking/revision enables optimal space complexity; revision/remasking DLMs are strictly more expressive than those without.

Conclusion: DLMs are theoretically justified as the most efficient parallel samplers, and enabling revision in DLMs is advocated for optimal space efficiency and expressivity.

Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks) or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity. We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more expressive than those without. Our results not only provide a theoretical justification for the promise of DLMs as the most efficient parallel sampler, but also advocate for enabling revision in DLMs.

[448] ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier

Main category: cs.LG

TL;DR: ResponseRank: A method that learns preference strength from noisy proxy signals (like response times or annotator agreement) by using relative differences within local strata to rank responses by inferred preference strength.

Details

Motivation: Binary choices in RLHF only convey direction, not strength of preference. Strength is crucial for decision-making under uncertainty and preference model generalization, but hard to measure reliably. Existing proxies (response times, inter-annotator agreement) are noisy and confounded.

Method: ResponseRank uses relative differences in proxy signals to rank responses to pairwise comparisons by inferred preference strength. To control for systemic variation, it compares signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings with minimal assumptions about the strength signal.

Result: Empirical evidence shows improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns).

Conclusion: ResponseRank provides a novel approach to robustly learn preference strength from noisy signals, with demonstrated effectiveness across multiple domains. The paper also introduces Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.

Abstract: Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.

[449] Generative Classifiers Avoid Shortcut Solutions

Alexander C. Li, Ananya Kumar, Deepak Pathak

Main category: cs.LG

TL;DR: Generative classifiers using class-conditional generative models outperform discriminative classifiers on distribution shift by modeling all features instead of relying on spurious correlations.

Details

Motivation: Discriminative classifiers often learn shortcuts and spurious correlations that fail under distribution shift. The paper aims to address this fundamental limitation by exploring generative approaches that model all features comprehensively.

Method: Use class-conditional generative models (diffusion-based and autoregressive) as generative classifiers. These models are simple to train without specialized augmentations, strong regularization, extra hyperparameters, or prior knowledge of spurious correlations.

Result: Generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks. They effectively reduce the impact of spurious correlations in practical applications like medical and satellite datasets. Analysis of a Gaussian toy setting reveals the inductive biases and data properties that determine when generative classifiers outperform discriminative ones.

Conclusion: Generative classifiers provide a robust alternative to discriminative approaches by modeling all features rather than relying on spurious correlations, making them more resilient to distribution shifts across diverse domains.

Abstract: Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.

[450] On the geometry and topology of representations: the manifolds of modular addition

Gabriela Moisescu-Pareja, Gavin McCracken, Harley Wiltzer, Vincent Létourneau, Colin Daniels, Doina Precup, Jonathan Love

Main category: cs.LG

TL;DR: The paper shows that uniform attention (Clock) and trainable attention (Pizza) architectures implement the same modular addition algorithm through topologically equivalent representations, contrary to previous interpretations.

Details

Motivation: To challenge the Clock and Pizza interpretations which argued that different architectural designs yield distinct circuits for modular addition, and to demonstrate that both architectures actually implement the same underlying algorithm.

Method: The authors go beyond individual neuron interpretation by identifying all neurons corresponding to each learned representation and studying them collectively as manifolds using topological tools. They statistically analyze learned representations across hundreds of circuits.

Result: Both uniform attention and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations, showing similarity between learned modular addition circuits from common deep learning paradigms.

Conclusion: The Clock and Pizza interpretations are not fundamentally different - both architectures converge on the same computational solution for modular addition, revealing deeper structural similarities in how neural networks learn mathematical operations.

Abstract: The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to argue that different architectural designs can yield distinct circuits for modular addition. In this work, we show that this is not the case, and that both uniform attention and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations. Our methodology goes beyond the interpretation of individual neurons and weights. Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity. This method reveals that each learned representation is a manifold that we can study utilizing tools from topology. Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.

[451] Maxwell’s Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Simon Dufort-Labbé, Pierluca D’Oro, Evgenii Nikishin, Razvan Pascanu, Pierre-Luc Bacon, Aristide Baratin

Main category: cs.LG

TL;DR: DemP (Demon Pruning) leverages dying neurons as a resource for structured pruning, using noise injection and one-cycle scheduling to achieve better accuracy-sparsity tradeoffs and faster training.

Details

Motivation: Traditional view sees dying neurons (inactive/saturated units) as harmful, but this paper explores their potential as a resource for efficient model compression and optimization.

Method: DemP controls dead neuron proliferation through noise injection on active units and a one-cycle schedule regularization strategy, dynamically creating network sparsity.

Result: Outperforms existing dense-to-sparse structured pruning methods on CIFAR-10 and ImageNet, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56×.

Conclusion: Dying neurons can be leveraged as a resource for efficient model compression, providing a novel perspective on this traditionally problematic phenomenon.

Abstract: When training neural networks, dying neurons – units becoming inactive or saturated – are traditionally seen as harmful. This paper sheds new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56$\times$. These findings provide a novel perspective on dying neurons as a resource for efficient model compression and optimization.

[452] Transfer learning of state-based potential games for process optimization in decentralized manufacturing systems

Steve Yuwono, Dorothea Schwung, Andreas Schwung

Main category: cs.LG

TL;DR: Novel online transfer learning approach for state-based potential games (TL-SbPGs) that enables manufacturing system players to share learned policies, improving efficiency and reducing power consumption.

Details

Motivation: Address practical industrial scenarios where knowledge sharing among similar players in large-scale, decentralized manufacturing systems can enhance learning outcomes and accelerate convergence.

Method: Develop transfer learning concepts and similarity criteria for players with two settings: predefined similarities and dynamically inferred similarities during training. Formalize SbPG framework applicability to transfer learning and optimize timing/weighting of knowledge transfer.

Result: Experimental results from laboratory-scale testbed show TL-SbPGs improve production efficiency and reduce power consumption compared to vanilla SbPGs.

Conclusion: TL-SbPGs provide an effective approach for distributed self-optimization in manufacturing systems by enabling knowledge transfer between similar players, leading to better performance and faster convergence.

Abstract: This paper presents a novel online transfer learning approach in state-based potential games (TL-SbPGs) for distributed self-optimization in manufacturing systems. The approach targets practical industrial scenarios where knowledge sharing among similar players enhances learning in large-scale and decentralized environments. TL-SbPGs enable players to reuse learned policies from others, which improves learning outcomes and accelerates convergence. To accomplish this goal, we develop transfer learning concepts and similarity criteria for players, which offer two distinct settings: (a) predefined similarities between players and (b) dynamically inferred similarities between players during training. The applicability of the SbPG framework to transfer learning is formally established. Furthermore, we present a method to optimize the timing and weighting of knowledge transfer. Experimental results from a laboratory-scale testbed show that TL-SbPGs improve production efficiency and reduce power consumption compared to vanilla SbPGs.

[453] Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, Dacheng Tao

Main category: cs.LG

TL;DR: A comprehensive survey paper on model merging techniques, providing taxonomy, applications across domains, and future research directions.

Details

Motivation: Model merging is an efficient technique that doesn't require raw training data or expensive computation, but there's a significant gap in systematic literature review of these techniques.

Method: Proposes a new taxonomic approach to exhaustively discuss existing model merging methods, and surveys applications across various domains including LLMs, multimodal LLMs, and 10+ ML subfields.

Result: Provides comprehensive overview of model merging methods/theories, their applications, and maintains a GitHub repository with extensive list of papers on model merging.

Conclusion: Highlights remaining challenges of model merging and discusses future research directions, addressing the literature gap with systematic review.

Abstract: Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and more than ten machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications.

[454] A Systematic Survey on Large Language Models for Algorithm Design

Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Kun Mao, Zhichao Lu, Zhenkun Wang, Mingxuan Yuan, Qingfu Zhang

Main category: cs.LG

TL;DR: This paper provides a systematic review of algorithm design with Large Language Models, categorizing LLM roles and synthesizing literature across the algorithm design pipeline and applications.

Details

Motivation: While LLMs have significantly advanced algorithm design automation and innovation across domains like combinatorial optimization and scientific discovery, there's a lack of systematic review. Existing surveys are either too narrow or have different objectives, hindering holistic understanding of the field.

Method: The paper introduces a taxonomy categorizing LLM roles as optimizers, predictors, extractors, and designers. It systematically reviews literature across three phases of the algorithm design pipeline and diverse algorithmic applications.

Result: The review analyzes progress, advantages, and limitations within each LLM role category and synthesizes the current landscape of algorithm design with LLMs across various applications.

Conclusion: The paper outlines key open challenges and opportunities to guide future research in algorithm design with LLMs, providing a comprehensive framework for understanding this rapidly expanding field.

Abstract: Algorithm design is crucial for effective problem-solving across various domains. The advent of Large Language Models (LLMs) has notably enhanced the automation and innovation within this field, offering new perspectives and promising solutions. In just a few years, this integration has yielded remarkable progress in areas ranging from combinatorial optimization to scientific discovery. Despite this rapid expansion, a holistic understanding of the field is hindered by the lack of a systematic review, as existing surveys either remain limited to narrow sub-fields or with different objectives. This paper seeks to provide a systematic review of algorithm design with LLMs. We introduce a taxonomy that categorises the roles of LLMs as optimizers, predictors, extractors and designers, analyzing the progress, advantages, and limitations within each category. We further synthesize literature across the three phases of the algorithm design pipeline and across diverse algorithmic applications that define the current landscape. Finally, we outline key open challenges and opportunities to guide future research.

[455] SoundnessBench: A Soundness Benchmark for Neural Network Verifiers

Xingjian Zhou, Keyi Shen, Andy Xu, Hongji Xu, Cho-Jui Hsieh, Huan Zhang, Zhouxing Shi

Main category: cs.LG

TL;DR: SoundnessBench is a new benchmark for testing neural network verifier soundness by creating instances with deliberately hidden counterexamples that existing adversarial attacks can’t find.

Details

Motivation: Existing NN verification benchmarks lack ground-truth for hard instances where no current verifier can verify properties and no counterexample can be found, making it difficult to validate verifier soundness when they claim verification on such challenging instances.

Method: Developed a training method to produce NNs with deliberately inserted counterexamples that are hidden from adversarial attacks commonly used to find counterexamples, systematically constructing SoundnessBench with instances across various model architectures, activation functions, and input data.

Result: The training effectively produces hidden counterexamples and SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers, demonstrating its effectiveness in testing verifier soundness.

Conclusion: SoundnessBench provides a valuable benchmark for testing NN verifier soundness, addressing a critical gap in existing verification benchmarks and helping ensure the reliability of NN verification tools in safety-critical applications.

Abstract: Neural network (NN) verification aims to formally verify properties of NNs, which is crucial for ensuring the behavior of NN-based models in safety-critical applications. In recent years, the community has developed many NN verifiers and benchmarks to evaluate them. However, existing benchmarks typically lack ground-truth for hard instances where no current verifier can verify the property and no counterexample can be found. This makes it difficult to validate the soundness of a verifier, when it claims verification on such challenging instances that no other verifier can handle. In this work, we develop a new benchmark for NN verification, named SoundnessBench, specifically for testing the soundness of NN verifiers. SoundnessBench consists of instances with deliberately inserted counterexamples that are hidden from adversarial attacks commonly used to find counterexamples. Thereby, it can identify false verification claims when hidden counterexamples are known to exist. We design a training method to produce NNs with hidden counterexamples and systematically construct our SoundnessBench with instances across various model architectures, activation functions, and input data. We demonstrate that our training effectively produces hidden counterexamples and our SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers. Our code is available at https://github.com/mvp-harry/SoundnessBench and our dataset is available at https://huggingface.co/datasets/SoundnessBench/SoundnessBench.

[456] ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Main category: cs.LG

TL;DR: ParetoHqD improves multiobjective LLM alignment by representing preferences as directions and using Pareto-front data in a two-stage fine-tuning process.

Details

Motivation: Current offline multiobjective alignment algorithms have limitations: inappropriate preference representations and imbalanced reward scores hinder performance. There's a need to better align LLMs with multiple human expectations and values to serve diverse user needs.

Method: ParetoHqD represents human preferences as preference directions in objective space and treats data near the Pareto front as “high-quality” data. It uses a two-stage supervised fine-tuning process where each stage uses an individual Pareto high-quality training set that best matches its preference direction.

Result: Experimental results show ParetoHqD outperforms five baselines on two multiobjective alignment tasks, demonstrating its superiority.

Conclusion: ParetoHqD effectively addresses limitations of existing multiobjective alignment methods by using preference direction representation and Pareto-front data selection, leading to improved LLM alignment with multiple human values.

Abstract: Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as “high-quality” data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

[457] Tazza: Shuffling Neural Network Parameters for Secure and Private Federated Learning

Kichang Lee, Jaeho Jin, JaeYeon Park, Songkuk Kim, JeongGil Ko

Main category: cs.LG

TL;DR: Tazza is a secure federated learning framework that simultaneously defends against gradient inversion and model poisoning attacks using weight shuffling and shuffled model validation, achieving robust security with 6.7x better efficiency.

Details

Motivation: Federated learning preserves data privacy but remains vulnerable to security threats like gradient inversion and model poisoning by malicious clients. Existing solutions address these issues separately, often sacrificing either system robustness or model accuracy.

Method: Tazza leverages permutation equivariance and invariance properties of neural networks through weight shuffling and shuffled model validation. This approach enhances resilience against diverse poisoning attacks while ensuring data confidentiality and high model accuracy.

Result: Comprehensive evaluations on various datasets and embedded platforms show Tazza achieves robust defense with up to 6.7x improved computational efficiency compared to alternative schemes, without compromising performance.

Conclusion: Tazza provides a secure and efficient federated learning framework that simultaneously addresses both gradient inversion and model poisoning attacks, offering improved computational efficiency while maintaining data privacy and model accuracy.

Abstract: Federated learning enables decentralized model training without sharing raw data, preserving data privacy. However, its vulnerability towards critical security threats, such as gradient inversion and model poisoning by malicious clients, remain unresolved. Existing solutions often address these issues separately, sacrificing either system robustness or model accuracy. This work introduces Tazza, a secure and efficient federated learning framework that simultaneously addresses both challenges. By leveraging the permutation equivariance and invariance properties of neural networks via weight shuffling and shuffled model validation, Tazza enhances resilience against diverse poisoning attacks, while ensuring data confidentiality and high model accuracy. Comprehensive evaluations on various datasets and embedded platforms show that Tazza achieves robust defense with up to 6.7x improved computational efficiency compared to alternative schemes, without compromising performance.

[458] Lagrangian Index Policy for Restless Bandits with Average Reward

Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah

Main category: cs.LG

TL;DR: The paper analyzes Lagrangian Index Policy (LIP) for restless bandits, showing it outperforms Whittle Index Policy (WIP) in problematic cases, proposes memory-efficient RL algorithms for LIP, and provides analytical results for specific applications.

Details

Motivation: To develop a more robust alternative to Whittle Index Policy for restless multi-armed bandits that maintains good performance even when WIP fails, while also providing memory-efficient learning algorithms for practical implementation.

Method: Theoretical analysis comparing LIP and WIP performance, development of both tabular and neural network-based reinforcement learning algorithms for LIP in model-free settings, analytical calculation of Lagrangian index for restart model, and new proof of asymptotic optimality using exchangeability and de Finetti’s theorem.

Result: LIP performs similarly to WIP in most cases but significantly outperforms WIP when WIP shows bad performance. The proposed RL algorithms for LIP require much less memory than analogous WIP schemes. Analytical results are provided for restart model applications including web crawling and age of information minimization.

Conclusion: LIP is a superior alternative to WIP for restless bandits, offering better robustness in problematic cases, memory-efficient learning algorithms, and analytical tractability for important applications, while maintaining asymptotic optimality guarantees.

Abstract: We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP require significantly less memory than the analogous schemes for WIP. We calculate analytically the Lagrangian index for the restart model, which applies to the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous arms as the number of arms goes to infinity, based on exchangeability and de Finetti’s theorem.

[459] Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison

Marianna Nezhurina, Jörg Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev

Main category: cs.LG

TL;DR: Open-sci-ref is a family of dense transformer models (0.13B-1.7B parameters) trained on 8 open reference datasets up to 1T tokens, providing standardized baselines for comparing training approaches across scales.

Details

Motivation: To establish reference baselines that enable researchers to assess the sanity and quality of alternative training approaches across different model scales and datasets, and to facilitate standardized comparison of training procedures.

Method: Train dense transformer models across multiple parameter scales (0.13B to 1.7B) and token scales (up to 1T tokens) on 8 recent open reference datasets, with intermediate checkpoints, logs, and evaluation code.

Result: Training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. The baselines enable comparison of training procedures through scaling trends on a common compute axis.

Conclusion: Open-sci-ref provides standardized reference baselines, intermediate checkpoints, and comprehensive evaluation tools to simplify reproduction, standardize comparison, and facilitate future research in model training approaches.

Abstract: We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.

[460] Deep sequence models tend to memorize geometrically; it is unclear why

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

Main category: cs.LG

TL;DR: Deep sequence models develop geometric memory that encodes global entity relationships, not just local co-occurrences, enabling efficient reasoning through navigation rather than brute-force lookup.

Details

Motivation: To understand how neural networks store and process atomic facts beyond simple associative memory, and to explore the emergence of geometric representations that enable more efficient reasoning.

Method: Identified geometric memory phenomenon, analyzed neural embedding geometries, connected to Node2Vec, and demonstrated how spectral bias naturally leads to geometric representations despite lacking typical pressures.

Result: Found that models synthesize geometric embeddings encoding global relationships between all entities, transforming hard reasoning tasks into easy navigation tasks, with geometry emerging even when more complex than brute-force lookup.

Conclusion: Geometric memory represents a fundamental alternative to associative lookup, emerging naturally from spectral bias, offering practitioners ways to enhance Transformer memory and suggesting revisiting intuitions about knowledge acquisition, capacity, discovery, and unlearning.

Abstract: Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that – in contrast to prevailing theories – indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

[461] Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

Main category: cs.LG

TL;DR: GRAPE is a unified framework for positional encoding using group actions, with two main variants: Multiplicative GRAPE (rotations in SO(d)) that generalizes RoPE, and Additive GRAPE (unipotent actions in GL) that generalizes ALiBi and FoX.

Details

Motivation: To create a principled, unified framework for positional encoding that brings together different families of positional mechanisms under a single group-theoretic foundation, allowing systematic exploration of positional geometry in long-context models.

Method: GRAPE uses group actions for positional encoding: (1) Multiplicative GRAPE uses rotations in SO(d) with position acting as G(n)=exp(nωL) with rank-2 skew generator L, (2) Additive GRAPE uses unipotent actions in GL that produce additive logit biases. The framework allows learned commuting subspaces and non-commuting mixtures for cross-subspace feature coupling.

Result: GRAPE recovers RoPE exactly when using canonical coordinate pairs with log-uniform spectrum, and recovers ALiBi and Forgetting Transformer as exact special cases. It provides a principled design space that subsumes existing methods while enabling new extensions with O(d) and O(rd) computational cost.

Conclusion: GRAPE offers a unified group-theoretic framework for positional encoding that systematically generalizes existing methods (RoPE, ALiBi, FoX) and provides a principled design space for exploring positional geometry in long-context models.

Abstract: We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n,ω,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.

[462] Lattice: Learning to Efficiently Compress the Memory

Mahdi Karami, Razvan Pascanu, Vahab Mirrokni

Main category: cs.LG

TL;DR: Lattice is a novel RNN mechanism that compresses attention KV cache into fixed memory slots using low-rank structure, achieving sub-quadratic complexity through orthogonal updates that only incorporate novel information.

Details

Motivation: Attention mechanisms suffer from quadratic computational complexity, making them inefficient for long sequences. There's a need for more efficient sequence learning methods that maintain performance while reducing computational overhead.

Method: Lattice formulates KV cache compression as an online optimization problem, using dynamic memory update rules based on gradient descent. It employs orthogonal updates where each memory slot only incorporates information orthogonal to its current state, minimizing interference. The method includes efficient computation and chunk-wise parallelization for training scalability.

Result: Lattice outperforms strong baselines on language modeling and associative recall tasks across diverse context lengths and model sizes, achieving superior memory efficiency with significantly reduced memory sizes.

Conclusion: Lattice provides an efficient, interpretable RNN mechanism that leverages low-rank structure of attention matrices to achieve sub-quadratic complexity while maintaining strong performance on sequence learning tasks.

Abstract: Attention mechanisms have revolutionized sequence learning but suffer from quadratic computational complexity. This paper introduces \model, a novel recurrent neural network (RNN) mechanism that leverages the inherent low-rank structure of K-V matrices to efficiently compress the cache into a fixed number of memory slots, achieving sub-quadratic complexity. We formulate this compression as an online optimization problem and derive a dynamic memory update rule based on a single gradient descent step. The resulting recurrence features a state- and input-dependent gating mechanism, offering an interpretable memory update process. The core innovation is the orthogonal update: each memory slot is updated exclusively with information orthogonal to its current state, hence incorporating only novel, non-redundant data to minimize interference with previously stored information. We derive an efficient computation for this orthogonal update rule and further approximate it with chunk-wise parallelization to ensure training scalability. Empirically, Lattice outperforms strong baselines on language modeling and associative recall tasks across diverse context lengths and model sizes, achieving superior memory efficiency with significantly reduced memory sizes.

[463] Probabilistically Tightened Linear Relaxation-based Perturbation Analysis for Neural Network Verification

Luca Marzari, Ferdinando Cicalese, Alessandro Farinelli

Main category: cs.LG

TL;DR: PT-LiRPA combines LiRPA over-approximation with sampling to compute tight neural network reachable sets, improving verification efficiency and robustness certificates with probabilistic guarantees.

Details

Motivation: Existing formal verification methods for neural networks can be computationally expensive and sometimes fail on challenging verification problems. There's a need for approaches that can provide tighter bounds while maintaining efficiency and offering probabilistic soundness guarantees.

Method: PT-LiRPA integrates LiRPA-based linear relaxation over-approximation techniques with a sampling-based method to estimate tight intermediate reachable sets. This combination allows for significant tightening of neural network output bounds with minimal computational overhead.

Result: The method improves robustness certificates by up to 3.31X and 2.26X compared to related work on standard benchmarks including the International Verification of Neural Networks Competition. It successfully solves challenging competition entries where state-of-the-art methods fail, providing answers with at least 99% confidence.

Conclusion: PT-LiRPA offers an effective framework that balances computational efficiency with tight bound estimation, providing probabilistic soundness guarantees for neural network verification while significantly improving robustness certification.

Abstract: We present $\textbf{P}$robabilistically $\textbf{T}$ightened $\textbf{Li}$near $\textbf{R}$elaxation-based $\textbf{P}$erturbation $\textbf{A}$nalysis ($\texttt{PT-LiRPA}$), a novel framework that combines over-approximation techniques from LiRPA-based approaches with a sampling-based method to compute tight intermediate reachable sets. In detail, we show that with negligible computational overhead, $\texttt{PT-LiRPA}$ exploiting the estimated reachable sets, significantly tightens the lower and upper linear bounds of a neural network’s output, reducing the computational cost of formal verification tools while providing probabilistic guarantees on verification soundness. Extensive experiments on standard formal verification benchmarks, including the International Verification of Neural Networks Competition, show that our $\texttt{PT-LiRPA}$-based verifier improves robustness certificates, i.e., the certified lower bound of $\varepsilon$ perturbation tolerated by the models, by up to 3.31X and 2.26X compared to related work. Importantly, our probabilistic approach results in a valuable solution for challenging competition entries where state-of-the-art formal verification methods fail, allowing us to provide answers with high confidence (i.e., at least 99%).

[464] Fast weight programming and linear transformers: from machine learning to neurobiology

Kazuki Irie, Samuel J. Gershman

Main category: cs.LG

TL;DR: This primer reviews Fast Weight Programmers (FWPs) - recurrent neural networks with 2D matrix-form hidden states that dynamically modify their own synaptic weights as short-term memory, connecting them to transformers, state space models, and biological synaptic plasticity.

Details

Motivation: The motivation is to review and explain the technical foundations of Fast Weight Programmers (FWPs), which represent a novel class of recurrent neural networks with 2D matrix-form hidden states that can dynamically modify their own synaptic weights, serving as short-term memory storage. This approach bridges artificial neural networks with biological models of synaptic plasticity.

Method: FWPs use recurrent neural network architectures with two-dimensional matrix-form hidden states instead of conventional vector-form states. These networks dynamically change their synaptic weights (fast weights) over time based on input observations, with weight modifications controlled by another network (the programmer) whose parameters are trained via gradient descent.

Result: The paper establishes FWPs as a distinct family of RNN architectures with computational characteristics that connect to transformers and state space models, while also demonstrating connections between FWPs and models of synaptic plasticity in the brain.

Conclusion: FWPs represent an important development in neural network architectures that bridges artificial and natural intelligence, offering insights into both machine learning models and biological neural computation through their dynamic weight modification mechanisms.

Abstract: Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence.

[465] Generalising Traffic Forecasting to Regions without Traffic Observations

Xinyu Su, Majid Sarvi, Feng Liu, Egemen Tanin, Jianzhong Qi

Main category: cs.LG

TL;DR: GenCast: A traffic forecasting model for regions without sensors that uses external knowledge and physics-informed neural networks to compensate for missing observations and improve generalizability.

Details

Motivation: Traffic forecasting typically requires continuous sensor observations, but many regions lack sensors due to high costs. Existing models struggle to generalize to these sensorless regions where historical traffic data is unavailable.

Method: Proposes GenCast with three key components: 1) Physics-informed neural networks to regularize learning with physical principles, 2) External signal learning module to explore correlations between traffic and external signals like weather, 3) Spatial grouping module to filter localized features that hinder generalization.

Result: Extensive experiments on multiple real-world datasets show that GenCast consistently reduces forecasting errors compared to existing methods.

Conclusion: GenCast effectively addresses the challenge of forecasting traffic in regions without sensors by leveraging external knowledge and physical principles, demonstrating improved generalization capabilities over existing approaches.

Abstract: Traffic forecasting is essential for intelligent transportation systems. Accurate forecasting relies on continuous observations collected by traffic sensors. However, due to high deployment and maintenance costs, not all regions are equipped with such sensors. This paper aims to forecast for regions without traffic sensors, where the lack of historical traffic observations challenges the generalisability of existing models. We propose a model named GenCast, the core idea of which is to exploit external knowledge to compensate for the missing observations and to enhance generalisation. We integrate physics-informed neural networks into GenCast, enabling physical principles to regularise the learning process. We introduce an external signal learning module to explore correlations between traffic states and external signals such as weather conditions, further improving model generalisability. Additionally, we design a spatial grouping module to filter localised features that hinder model generalisability. Extensive experiments show that GenCast consistently reduces forecasting errors on multiple real-world datasets.

[466] STRelay: A Universal Spatio-Temporal Relaying Framework for Location Prediction over Human Trajectory Data

Bangchao Deng, Lianhua Ji, Chunhua Chen, Xin Jing, Ling Ding, Bingqing QU, Pengyang Wang, Dingqi Yang

Main category: cs.LG

TL;DR: STRelay is a spatiotemporal relaying framework that models future spatiotemporal contexts to boost next location prediction performance across different base models.

Details

Motivation: Existing location prediction methods overlook the importance of future spatiotemporal contexts (like travel time and distance) which provide critical clues for predicting future locations, especially for non-routine activities with higher uncertainty.

Method: STRelay models future spatiotemporal contexts in a relaying manner, integrates them with historical representations from base models, and uses multi-task learning to simultaneously predict next time interval, next moving distance interval, and next location.

Result: STRelay consistently improves prediction performance by 2.49%-11.30% across five state-of-the-art base models on four real-world trajectory datasets, with particular benefits for entertainment-related locations and users traveling longer distances.

Conclusion: Future spatiotemporal contexts are valuable for location prediction, especially complementing base models that excel at regular daily routines but struggle with non-routine activities, making STRelay a universal framework that enhances existing methods.

Abstract: Next location prediction is a critical task in human mobility modeling, enabling applications like travel planning and urban mobility management. Existing methods mainly rely on historical spatiotemporal trajectory data to train sequence models that directly forecast future locations. However, they often overlook the importance of the future spatiotemporal contexts, which are highly informative for the future locations. For example, knowing how much time and distance a user will travel could serve as a critical clue for predicting the user’s next location. Against this background, we propose \textbf{STRelay}, a universal \textbf{\underline{S}}patio\textbf{\underline{T}}emporal \textbf{\underline{Relay}}ing framework explicitly modeling the future spatiotemporal context given a human trajectory, to boost the performance of different location prediction models. Specifically, STRelay models future spatiotemporal contexts in a relaying manner, which is subsequently integrated with the encoded historical representation from a base location prediction model, enabling multi-task learning by simultaneously predicting the next time interval, next moving distance interval, and finally the next location. We evaluate STRelay integrated with five state-of-the-art location prediction base models on four real-world trajectory datasets. Results demonstrate that STRelay consistently improves prediction performance across all cases by 2.49%-11.30%. Additionally, we find that the future spatiotemporal contexts are particularly helpful for entertainment-related locations and also for user groups who prefer traveling longer distances. The performance gain on such non-daily-routine activities, which often suffer from higher uncertainty, is indeed complementary to the base location prediction models that often excel at modeling regular daily routine patterns.

[467] RAST: A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction

Weilin Ruan, Xilin Dang, Ziyu Zhou, Sisuo Lyu, Yuxuan Liang

Main category: cs.LG

TL;DR: RAST is a retrieval-augmented framework for traffic prediction that addresses limited contextual capacity and heterogeneous patterns by integrating retrieval mechanisms with spatio-temporal modeling.

Details

Motivation: Despite progress in STGNNs and pre-trained models, traffic prediction still faces two key challenges: (1) limited contextual capacity when modeling complex spatio-temporal dependencies, and (2) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns.

Method: RAST integrates retrieval-augmented mechanisms with spatio-temporal modeling through three key designs: 1) Decoupled Encoder and Query Generator for spatial/temporal feature capture and fusion query construction, 2) Spatio-temporal Retrieval Store and Retrievers for maintaining and retrieving fine-grained patterns, and 3) Universal Backbone Predictor that flexibly accommodates various prediction models.

Result: Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.

Conclusion: RAST provides a universal framework that effectively addresses key challenges in traffic prediction by leveraging retrieval-augmented mechanisms to enhance spatio-temporal modeling capabilities.

Abstract: Traffic prediction is a cornerstone of modern intelligent transportation systems and a critical task in spatio-temporal forecasting. Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have achieved significant progress in traffic prediction, two key challenges remain: (i) limited contextual capacity when modeling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns. Inspired by Retrieval-Augmented Generation (RAG), we propose RAST, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges. Our framework consists of three key designs: 1) Decoupled Encoder and Query Generator to capture decoupled spatial and temporal features and construct a fusion query via residual fusion; 2) Spatio-temporal Retrieval Store and Retrievers to maintain and retrieve vectorized fine-grained patterns; and 3) Universal Backbone Predictor that flexibly accommodates pre-trained STGNNs or simple MLP predictors. Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.

[468] Adversarial Reinforcement Learning Framework for ESP Cheater Simulation

Inkyu Park, Jeong-Gwan Lee, Taehwan Kwon, Juheon Choi, Seungku Kim, Junsu Kim, Kimin Lee

Main category: cs.LG

TL;DR: A simulation framework for modeling ESP cheaters in games using reinforcement learning agents and adversarial game theory to study adaptive cheating behaviors and develop detectors.

Details

Motivation: ESP cheats are hard to detect because their effects aren't directly observable in player behavior, making labeled data collection difficult. Cheaters also adapt their behavior to evade detection, complicating anti-cheat system development.

Method: Proposed simulation framework with RL agents (cheaters/non-cheaters) with different observability levels. Formulated cheater-detector interaction as adversarial game. Introduced structured cheater model that dynamically switches between cheating/non-cheating based on detection risk.

Result: Framework successfully simulates adaptive cheater behaviors that strategically balance reward optimization and detection evasion. Provides controllable platform for studying cheating behaviors and developing detectors.

Conclusion: The work offers an extensible simulation platform for studying adaptive cheating behaviors and developing effective cheat detectors, addressing challenges in ESP cheat detection.

Abstract: Extra-Sensory Perception (ESP) cheats, which reveal hidden in-game information such as enemy locations, are difficult to detect because their effects are not directly observable in player behavior. The lack of observable evidence makes it difficult to collect reliably labeled data, which is essential for training effective anti-cheat systems. Furthermore, cheaters often adapt their behavior by limiting or disguising their cheat usage, which further complicates detection and detector development. To address these challenges, we propose a simulation framework for controlled modeling of ESP cheaters, non-cheaters, and trajectory-based detectors. We model cheaters and non-cheaters as reinforcement learning agents with different levels of observability, while detectors classify their behavioral trajectories. Next, we formulate the interaction between the cheater and the detector as an adversarial game, allowing both players to co-adapt over time. To reflect realistic cheater strategies, we introduce a structured cheater model that dynamically switches between cheating and non-cheating behaviors based on detection risk. Experiments demonstrate that our framework successfully simulates adaptive cheater behaviors that strategically balance reward optimization and detection evasion. This work provides a controllable and extensible platform for studying adaptive cheating behaviors and developing effective cheat detectors.

[469] Optimal Approximation – Smoothness Tradeoffs for Soft-Max Functions

Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, Manolis Zampetakis

Main category: cs.LG

TL;DR: The paper analyzes optimal tradeoffs between approximation and smoothness in soft-max functions, introducing novel mechanisms with different optimality properties for various applications.

Details

Motivation: To identify optimal approximation-smoothness tradeoffs for soft-max functions, as different applications require different efficiency measures (approximation quality vs. sensitivity to input changes).

Method: Introduces three novel soft-max functions: (1) exponential mechanism (optimal for expected additive approximation with Rényi Divergence smoothness), (2) piecewise linear soft-max (optimal for worst-case additive approximation with ℓq-norm smoothness), and (3) power mechanism (optimal for expected multiplicative approximation with Rényi Divergence smoothness).

Result: The exponential mechanism has optimal tradeoff for expected additive approximation with Rényi Divergence smoothness. The piecewise linear mechanism provides optimal worst-case additive approximation with ℓq-norm smoothness and enforces sparsity. The power mechanism offers optimal expected multiplicative approximation with Rényi Divergence smoothness, improving differentially private submodular optimization.

Conclusion: Different soft-max functions are optimal for different applications based on their approximation-smoothness tradeoffs. The piecewise linear mechanism provides sparsity and ℓq-smoothness for ML and game theory, while the power mechanism improves differentially private optimization.

Abstract: A soft-max function has two main efficiency measures: (1) approximation - which corresponds to how well it approximates the maximum function, (2) smoothness - which shows how sensitive it is to changes of its input. Our goal is to identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. This leads to novel soft-max functions, each of which is optimal for a different application. The most commonly used soft-max function, called exponential mechanism, has optimal tradeoff between approximation measured in terms of expected additive approximation and smoothness measured with respect to Rényi Divergence. We introduce a soft-max function, called “piecewise linear soft-max”, with optimal tradeoff between approximation, measured in terms of worst-case additive approximation and smoothness, measured with respect to $\ell_q$-norm. The worst-case approximation guarantee of the piecewise linear mechanism enforces sparsity in the output of our soft-max function, a property that is known to be important in Machine Learning applications [Martins et al. ‘16, Laha et al. ‘18] and is not satisfied by the exponential mechanism. Moreover, the $\ell_q$-smoothness is suitable for applications in Mechanism Design and Game Theory where the piecewise linear mechanism outperforms the exponential mechanism. Finally, we investigate another soft-max function, called power mechanism, with optimal tradeoff between expected \textit{multiplicative} approximation and smoothness with respect to the Rényi Divergence, which provides improved theoretical and practical results in differentially private submodular optimization.

[470] On the limitation of evaluating machine unlearning using only a single training seed

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

Main category: cs.LG

TL;DR: The paper warns that standard empirical comparisons of machine unlearning algorithms can be misleading because deterministic MU methods are highly sensitive to the random seed used during initial model training, even for the same architecture and dataset.

Details

Motivation: Current empirical evaluations of machine unlearning algorithms often run multiple independent trials starting from the same trained model, but this practice fails to account for variability introduced by different training seeds, potentially leading to non-representative comparisons.

Method: The authors demonstrate through analysis that deterministic machine unlearning methods are particularly sensitive to the random number seed used during the initial model training phase, showing that identical MU algorithms can produce different results depending on the training seed.

Result: The paper shows that empirical comparisons of MU algorithms can be highly non-representative when only considering multiple runs from the same trained model, as the choice of training seed significantly impacts the performance of deterministic MU methods.

Conclusion: Researchers should incorporate variability across different model training seeds when empirically comparing machine unlearning algorithms to ensure more representative and reliable evaluations, rather than just running multiple trials from the same trained model.

Abstract: Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because – even for the same architecture and same dataset – some MU methods can be highly sensitive to the choice of random number seed used for model training. We illustrate that this is particularly relevant for MU methods that are deterministic, i.e., which always produce the same result when started from the same trained model. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.

[471] Active Learning with Neural Networks: Insights from Nonparametric Statistics

Yinglun Zhu, Robert Nowak

Main category: cs.LG

TL;DR: Deep active learning with neural networks achieves near-optimal label complexity under low noise conditions, with polylog(1/ε) complexity when equipped with abstention.

Details

Motivation: Deep neural networks require large labeled datasets, creating a gap between empirical successes of deep active learning and theoretical guarantees. This paper aims to bridge this theory-practice gap by providing rigorous label complexity guarantees for deep active learning.

Method: Study deep active learning from nonparametric classification perspective. Under standard low noise conditions, analyze active learning with neural networks. Develop efficient deep active learning algorithm with abstention option that achieves polylog(1/ε) label complexity without low noise assumptions.

Result: First near-optimal label complexity guarantees for deep active learning. Show that active learning with neural networks can achieve minimax label complexity up to disagreement coefficient and logarithmic terms. With abstention, achieve polylog(1/ε) label complexity without low noise assumptions.

Conclusion: The paper bridges theory-practice gap by providing rigorous theoretical guarantees for deep active learning, showing neural networks can achieve near-optimal label complexity, with even better guarantees when equipped with abstention. Extends results beyond Sobolev/Hölder spaces to Radon BV² spaces associated with neural networks.

Abstract: Deep neural networks have great representation power, but typically require large numbers of training examples. This motivates deep active learning methods that can significantly reduce the amount of labeled training data. Empirical successes of deep active learning have been recently reported in the literature, however, rigorous label complexity guarantees of deep active learning have remained elusive. This constitutes a significant gap between theory and practice. This paper tackles this gap by providing the first near-optimal label complexity guarantees for deep active learning. The key insight is to study deep active learning from the nonparametric classification perspective. Under standard low noise conditions, we show that active learning with neural networks can provably achieve the minimax label complexity, up to disagreement coefficient and other logarithmic terms. When equipped with an abstention option, we further develop an efficient deep active learning algorithm that achieves $\mathsf{polylog}(\frac{1}ε)$ label complexity, without any low noise assumptions. We also provide extensions of our results beyond the commonly studied Sobolev/Hölder spaces and develop label complexity guarantees for learning in Radon $\mathsf{BV}^2$ spaces, which have recently been proposed as natural function spaces associated with neural networks.

[472] Can machines think efficiently?

Adam Winchell

Main category: cs.LG

TL;DR: The paper proposes updating the Turing Test by adding an energy efficiency constraint to evaluate intelligence through resource consumption, addressing ethical and environmental concerns of modern AI.

Details

Motivation: The original Turing Test is outdated because modern AI systems can already pass it, leading to serious ethical and environmental concerns. There's an urgent need for a more practical test that connects intelligence evaluation to real-world resource constraints.

Method: The work expands the original Turing Test by adding an energy constraint - measuring the energy spent answering questions. This forces intelligence evaluation through the lens of efficiency, connecting abstract thinking to concrete resource limitations.

Result: The proposed new test provides a measurable, practical finish line that the original test lacks, compelling society to weigh AI’s time savings against its total resource cost.

Conclusion: The energy-constrained Turing Test offers a more relevant framework for evaluating intelligence in the modern era, addressing both ethical concerns and practical resource limitations while maintaining the core concept of distinguishing human and machine intelligence.

Abstract: The Turing Test is no longer adequate for distinguishing human and machine intelligence. With advanced artificial intelligence systems already passing the original Turing Test and contributing to serious ethical and environmental concerns, we urgently need to update the test. This work expands upon the original imitation game by accounting for an additional factor: the energy spent answering the questions. By adding the constraint of energy, the new test forces us to evaluate intelligence through the lens of efficiency, connecting the abstract problem of thinking to the concrete reality of finite resources. Further, this proposed new test ensures the evaluation of intelligence has a measurable, practical finish line that the original test lacks. This additional constraint compels society to weigh the time savings of using artificial intelligence against its total resource cost.

[473] The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing

Xingyu Xu, Yandi Shen, Yuejie Chi, Cong Ma

Main category: cs.LG

TL;DR: ScaledGD(λ) is a preconditioned gradient descent method for low-rank matrix sensing that handles unknown rank and ill-conditioning through damped preconditioning, achieving linear convergence with only logarithmic dependency on condition number.

Details

Motivation: The paper addresses two key challenges in low-rank matrix sensing: (1) unknown true rank requiring overparameterized factor representations, and (2) potential ill-conditioning of the target matrix. Vanilla gradient descent suffers from polynomial dependency on condition number and struggles with bad curvatures induced by overparameterization.

Method: ScaledGD(λ) uses overparameterized factor representations with small random initialization. It employs a specific form of damped preconditioning during gradient descent to combat bad curvatures caused by overparameterization and ill-conditioning. The preconditioning adds light computational overhead but significantly improves robustness.

Result: Under Gaussian design assumptions, ScaledGD(λ) converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and problem dimension. This represents a significant improvement over vanilla GD’s polynomial dependency on condition number.

Conclusion: The work demonstrates that preconditioning can accelerate convergence without harming generalization in overparameterized learning, providing evidence for the power of preconditioning techniques in handling ill-conditioned problems with unknown rank structure.

Abstract: We propose $\textsf{ScaledGD($λ$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($λ$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($λ$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($λ$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.

[474] HiGen: Hierarchical Graph Generative Networks

Mahdi Karami

Main category: cs.LG

TL;DR: HiGen: A hierarchical graph generative network that captures graph hierarchy and generates graphs in coarse-to-fine fashion, achieving state-of-the-art performance.

Details

Motivation: Most real-world graphs have hierarchical structures that existing graph generation methods overlook, limiting their ability to capture complex graph properties.

Method: Proposes a hierarchical generative network that generates communities in parallel at each hierarchy level, then predicts cross-edges between communities using separate neural networks. Models edge distribution with multinomial distribution and uses recursive factorization for integer-valued edge weights in autoregressive generation.

Result: Empirical studies show effectiveness and scalability, achieving state-of-the-art performance in graph quality across various benchmark datasets.

Conclusion: The proposed hierarchical graph generative model successfully captures graph hierarchical structures and enables scalable generation of large, complex graphs with superior quality.

Abstract: Most real-world graphs exhibit a hierarchical structure, which is often overlooked by existing graph generation methods. To address this limitation, we propose a novel graph generative network that captures the hierarchical nature of graphs and successively generates the graph sub-structures in a coarse-to-fine fashion. At each level of hierarchy, this model generates communities in parallel, followed by the prediction of cross-edges between communities using separate neural networks. This modular approach enables scalable graph generation for large and complex graphs. Moreover, we model the output distribution of edges in the hierarchical graph with a multinomial distribution and derive a recursive factorization for this distribution. This enables us to generate community graphs with integer-valued edge weights in an autoregressive manner. Empirical studies demonstrate the effectiveness and scalability of our proposed generative model, achieving state-ofthe-art performance in terms of graph quality across various benchmark datasets. The code is available at https://github.com/Karami-m/HiGen_main.

[475] Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling

Mahdi Karami, Ali Ghodsi

Main category: cs.LG

TL;DR: Orchid is a novel architecture that replaces quadratic attention with data-dependent global convolution for efficient long-sequence modeling while maintaining expressivity.

Details

Motivation: Address the computational inefficiency of traditional attention mechanisms (quadratic complexity) while preserving their ability to capture long-range dependencies and in-context learning capabilities.

Method: Introduces a data-dependent global convolution layer with contextually adaptive kernels conditioned on input sequences using dedicated conditioning neural networks that maintain shift equivariance.

Result: Outperforms attention-based architectures (BERT, Vision Transformers) with smaller model sizes, extends feasible sequence length beyond dense attention limitations, and maintains quasilinear scalability.

Conclusion: Orchid represents a significant advancement toward more efficient and scalable deep learning models for sequence modeling by achieving high expressivity with quasilinear complexity.

Abstract: In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling. The code is available at https://github.com/Karami-m/orchid.

[476] Spectral Convolutional Conditional Neural Processes

Peiman Mohseni, Nick Duffield

Main category: cs.LG

TL;DR: SConvCNPs improve ConvCNPs by using global convolution in frequency domain instead of local spatial kernels, enabling better capture of long-range dependencies with lower computational cost.

Details

Motivation: Early Neural Processes had finite-dimensional global representations that mismatched infinite-dimensional stochastic processes. ConvCNPs addressed this with functional embeddings but used local CNN kernels that struggle with long-range dependencies without expensive large kernels.

Method: Propose Spectral ConvCNPs (SConvCNPs) that perform global convolution in frequency domain, inspired by Fourier Neural Operators. Directly parameterize convolution kernels in frequency domain to leverage compact Fourier representation of natural signals.

Result: Validated effectiveness on both synthetic and real-world datasets, demonstrating how operator learning ideas can advance Neural Process capabilities.

Conclusion: SConvCNPs overcome limitations of local spatial kernels in ConvCNPs by using global frequency-domain convolution, enabling better long-range dependency capture with computational efficiency.

Abstract: Neural Processes (NPs) are meta-learning models that learn to map sets of observations to approximations of the corresponding posterior predictive distributions. By accommodating variable-sized, unstructured collections of observations and enabling probabilistic predictions at arbitrary query points, NPs provide a flexible framework for modeling functions over continuous domains. Since their introduction, numerous variants have emerged; however, early formulations shared a fundamental limitation: they compressed the observed data into finite-dimensional global representations via aggregation operations such as mean pooling. This strategy induces an intrinsic mismatch with the infinite-dimensional nature of the stochastic processes that NPs intend to model. Convolutional conditional neural processes (ConvCNPs) address this limitation by constructing infinite-dimensional functional embeddings processed through convolutional neural networks (CNNs) to enforce translation equivariance. Yet CNNs with local spatial kernels struggle to capture long-range dependencies without resorting to large kernels, which impose significant computational costs. To overcome this limitation, we propose spectral ConvCNPs (SConvCNPs), which perform global convolution in the frequency domain. Inspired by Fourier neural operators (FNOs) for learning solution operators of partial differential equations (PDEs), our approach directly parameterizes convolution kernels in the frequency domain, leveraging the relatively compact yet global Fourier representation of many natural signals. We validate the effectiveness of SConvCNPs on both synthetic and real-world datasets, demonstrating how ideas from operator learning can advance the capabilities of NPs.

[477] Jacobian-Enhanced Neural Networks

Steven H. Berguin

Main category: cs.LG

TL;DR: JENN improves neural network accuracy with fewer training points by predicting partial derivatives, making it superior for surrogate-based optimization where gradient accuracy is critical.

Details

Motivation: In computer-aided design, there's a need to replace expensive physics-based models with fast surrogate models. Standard neural networks require many training points, and gradient-enhanced methods need accurate partial derivatives for effective surrogate-based optimization.

Method: Jacobian-Enhanced Neural Networks (JENN) are modified densely connected multi-layer perceptrons whose training process is altered to predict partial derivatives accurately, enabling better accuracy with fewer training points.

Result: JENN demonstrates superior performance over standard neural networks for surrogate-based optimization, providing accurate partial derivatives and better overall accuracy with reduced training data requirements.

Conclusion: JENN offers a significant advancement for surrogate modeling in computer-aided design, particularly for optimization applications where gradient accuracy is essential, providing both computational efficiency and improved accuracy.

Abstract: Jacobian-Enhanced Neural Networks (JENN) are densely connected multi-layer perceptrons, whose training process is modified to predict partial derivatives accurately. Their main benefit is better accuracy with fewer training points compared to standard neural networks. These attributes are particularly desirable in the field of computer-aided design, where there is often the need to replace computationally expensive, physics-based models with fast running approximations, known as surrogate models or meta-models. Since a surrogate emulates the original model accurately in near-real time, it yields a speed benefit that can be used to carry out orders of magnitude more function calls quickly. However, in the special case of gradient-enhanced methods, there is the additional value proposition that partial derivatives are accurate, which is a critical property for one important use-case: surrogate-based optimization. This work derives the complete theory and exemplifies its superiority over standard neural nets for surrogate-based optimization.

[478] UnPaSt: unsupervised patient stratification by biclustering of omics data

Michael Hartung, Andreas Maier, Yuliya Burankova, Fernando Delgado-Chaves, Olga I. Isaeva, Alexey Savchik, Fábio Malta de Sá Patroni, Jens J. G. Lohmann, Daniel He, Casey Shannon, Jan-Ole Schulze, Katharina Kaufmann, Zoe Chervontseva, Farzaneh Firoozbakht, Anne Hartebrodt, Niklas Probul, Olga Tsoy, Alexandra Abisheva, Evgenia Zotova, Kavya Singh, Kristel Van Steen, Malte Kuehl, Victor G. Puelles, David B. Blumenthal, Martin Ester, Tanja Laske, Jan Baumbach, Olga Zolotareva

Main category: cs.LG

TL;DR: UnPaSt is a novel biclustering algorithm for unsupervised patient stratification that outperforms existing methods in identifying disease subtypes, especially for non-mutually exclusive subtypes or those with few biomarkers.

Details

Motivation: Current unsupervised patient stratification methods are primarily benchmarked on cancers with mutually exclusive, well-differentiated subtypes, but they perform poorly for non-oncological diseases with non-mutually exclusive subtypes or subtypes discriminated by few biomarkers.

Method: Developed UnPaSt, a novel biclustering algorithm based on differentially expressed biclusters for unsupervised patient stratification. Evaluated 22 existing methods (clustering and biclustering) using simulated and real transcriptomics data.

Result: UnPaSt outperformed widely used patient stratification approaches in identifying known subtypes of breast cancer and asthma. It detected biologically insightful patterns across multiple data types (bulk transcriptomics, proteomics, single-cell, spatial transcriptomics, multi-omics) and provided more nuanced, interpretable views of data heterogeneity.

Conclusion: UnPaSt addresses limitations of existing methods and advances precision medicine by enabling more accurate and interpretable patient stratification for diseases with complex molecular heterogeneity, particularly those with non-mutually exclusive subtypes or subtypes defined by few biomarkers.

Abstract: Unsupervised patient stratification is essential for disease subtype discovery, yet, despite growing evidence of molecular heterogeneity of non-oncological diseases, popular methods are benchmarked primarily using cancers with mutually exclusive molecular subtypes well-differentiated by numerous biomarkers. Evaluating 22 unsupervised methods, including clustering and biclustering, using simulated and real transcriptomics data revealed their inefficiency in scenarios with non-mutually exclusive subtypes or subtypes discriminated only by few biomarkers. To address these limitations and advance precision medicine, we developed UnPaSt, a novel biclustering algorithm for unsupervised patient stratification based on differentially expressed biclusters. UnPaSt outperformed widely used patient stratification approaches in the de novo identification of known subtypes of breast cancer and asthma. In addition, it detected many biologically insightful patterns across bulk transcriptomics, proteomics, single-cell, spatial transcriptomics, and multi-omics datasets, enabling a more nuanced and interpretable view of high-throughput data heterogeneity than traditionally used methods.

[479] Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching

Etrit Haxholli, Yeti Z. Gürbüz, Oğul Can, Eli Waxman

Main category: cs.LG

TL;DR: Discrete flow matching for categorical data with dynamic optimal transport objective reduces state transitions 8x while maintaining performance, plus new perplexity bounds and Multimask Flows that outperform masked flows.

Details

Motivation: Discrete flow matching shows competitive performance but cannot use rectification strategy like continuous flows due to stochastic discrete paths, requiring alternative methods to minimize state transitions. Also lacks instantaneous change-of-variables for precise probability estimation.

Method: Proposes dynamic-optimal-transport-like minimization objective with Kantorovich formulation for discrete flows with convex interpolants, optimized via minibatch strategies. Introduces two upper bounds on perplexity for principled training/evaluation. Presents Multimask Flows that outperform masked flows.

Result: For bag-of-words sourced flows, reduces transitions up to 8 times (1024 to 128) to reach same generative perplexity without compromising diversity. Multimask Flows outperform masked flows in generative perplexity, especially with minibatch Optimal Transport, without sacrificing diversity.

Conclusion: Proposed methods enable efficient discrete flow matching with reduced transitions and principled training/evaluation via perplexity bounds, with Multimask Flows showing superior performance to existing approaches.

Abstract: Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamic-optimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state similarity and can be optimized via minibatch strategies. In the case of bag-of-words (BoW) sourced flows, we show that such methods can reduce the number of transitions up to 8 times (1024 to 128) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability estimation available to continuous flows. We therefore propose two upper bounds on perplexity, enabling principled training, evaluation and model comparison. Finally, we introduce Multimask Flows which outperform masked flows in generative perplexity, particularly when utilizing minibatch Optimal Transport, without sacrificing diversity.

[480] OPTIMA: Optimal One-shot Pruning for LLMs via Quadratic Programming Reconstruction

Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh

Main category: cs.LG

TL;DR: OPTIMA introduces a practical one-shot post-training pruning method that uses row-wise Quadratic Programs with shared Hessian to achieve optimal weight updates, balancing accuracy and scalability for large language models.

Details

Motivation: Current post-training pruning methods face a trade-off: simple heuristics are fast but degrade accuracy, while principled optimization methods recover accuracy but are computationally infeasible at modern scale. There's a need for a method that balances accuracy and scalability.

Method: OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. It implements an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot pruning without fine-tuning.

Result: OPTIMA consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. It prunes an 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory on an NVIDIA H100.

Conclusion: OPTIMA sets a new state-of-the-art accuracy-efficiency trade-off for one-shot post-training pruning, providing a practical solution that balances accuracy recovery with computational feasibility at modern scale.

Abstract: Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.

[481] The Generalization Error of Supervised Machine Learning Algorithms

Samir M. Perlaza, Xinying Zou

Main category: cs.LG

TL;DR: The paper introduces the “method of gaps” for deriving closed-form expressions of generalization error in supervised learning using information measures, distinguishing between algorithm-driven and data-driven gaps.

Details

Motivation: To develop a unified framework for obtaining exact expressions of generalization error in supervised machine learning that connects with information theory and statistics.

Method: Introduces the method of gaps, which characterizes variation in expected empirical risk when either model or dataset is fixed. Uses two types of gaps: algorithm-driven (fixed dataset) and data-driven (fixed model), both expressible in terms of relative entropies and Gibbs probability measures.

Result: The method can derive all existing exact expressions for generalization error and produces numerous new expressions, establishing connections with other statistical areas.

Conclusion: The method of gaps provides a comprehensive framework for analyzing generalization error, revealing Gibbs probability measures as natural references for supervised learning analysis.

Abstract: In this paper, the method of gaps, a technique for deriving closed-form expressions in terms of information measures for the generalization error of supervised machine learning algorithms is introduced. The method relies on the notion of \emph{gaps}, which characterize the variation of the expected empirical risk (when either the model or dataset is kept fixed) with respect to changes in the probability measure on the varying parameter (either the dataset or the model, respectively). This distinction results in two classes of gaps: Algorithm-driven gaps (fixed dataset) and data-driven gaps (fixed model). In general, the method relies on two central observations: $(i)$~The generalization error is the expectation of an algorithm-driven gap or a data-driven gap. In the first case, the expectation is with respect to a measure on the datasets; and in the second case, with respect to a measure on the models. $(ii)$~Both, algorithm-driven gaps and data-driven gaps exhibit closed-form expressions in terms of relative entropies. In particular, algorithm-driven gaps involve a Gibbs probability measure on the set of models, which represents a supervised Gibbs algorithm. Alternatively, data-driven gaps involve a worst-case data-generating (WCDG) probability measure on the set of data points, which is also a Gibbs probability measure. Interestingly, such Gibbs measures, which are exogenous to the analysis of generalization, place both the supervised Gibbs algorithm and the WCDG probability measure as natural references for the analysis of supervised learning algorithms. All existing exact expressions for the generalization error of supervised machine learning algorithms can be obtained with the proposed method. Also, this method allows obtaining numerous new exact expressions, which allows establishing connections with other areas in statistics.

[482] Private Linear Regression with Differential Privacy and PAC Privacy

Hillary Yang, Yuntao Du

Main category: cs.LG

TL;DR: Comparison of differential privacy vs PAC Privacy for linear regression across three real-world datasets, revealing key performance differences.

Details

Motivation: Linear regression is fundamental for statistical analysis, and while differential privacy has been well-established for privacy-preserving methods, the newly proposed PAC Privacy framework hasn't been explored for linear regression. The paper aims to systematically compare these two privacy approaches.

Method: The researchers systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets. They conduct empirical evaluation and analysis of both privacy frameworks in the context of linear regression.

Result: The study observes several key findings that impact the performance of privacy-preserving linear regression. The comparison reveals important differences between differential privacy and PAC privacy approaches for this fundamental statistical task.

Conclusion: The systematic comparison provides insights into how different privacy frameworks (differential privacy vs PAC Privacy) perform for linear regression, offering guidance for practitioners choosing privacy-preserving methods for statistical analysis tasks.

Abstract: Linear regression is a fundamental tool for statistical analysis, which has motivated the development of linear regression methods that satisfy provable privacy guarantees so that the learned model reveals little about any one data point used to construct it. Most existing privacy-preserving linear regression methods rely on the well-established framework of differential privacy, while the newly proposed PAC Privacy has not yet been explored in this context. In this paper, we systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets, observing several key findings that impact the performance of privacy-preserving linear regression.

[483] Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models

Zong Ke, Shicheng Zhou, Yining Zhou, Chia Hong Chang, Rong Zhang

Main category: cs.LG

TL;DR: Proposes a GAN-based model to detect AI deepfakes in online payment systems, achieving over 95% accuracy in distinguishing legitimate transactions from manipulated images.

Details

Motivation: The growing prevalence of deepfake technology that manipulates facial features in images/videos has escalated fraud potential in online transactions, and traditional security systems struggle to identify these sophisticated fraud forms.

Method: Develops a novel GAN-based model trained on a dataset of real-world online payment images and deepfake images generated using advanced GAN architectures like StyleGAN and DeepFake to identify subtle manipulations in payment images.

Result: The proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%, significantly improving payment system robustness against AI-driven fraud.

Conclusion: The research contributes to digital security by demonstrating effective application of GANs for fraud detection in financial services, enhancing online payment security against sophisticated AI-generated deepfakes.

Abstract: This study explores the use of Generative Adversarial Networks (GANs) to detect AI deepfakes and fraudulent activities in online payment systems. With the growing prevalence of deepfake technology, which can manipulate facial features in images and videos, the potential for fraud in online transactions has escalated. Traditional security systems struggle to identify these sophisticated forms of fraud. This research proposes a novel GAN-based model that enhances online payment security by identifying subtle manipulations in payment images. The model is trained on a dataset consisting of real-world online payment images and deepfake images generated using advanced GAN architectures, such as StyleGAN and DeepFake. The results demonstrate that the proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%. This approach significantly improves the robustness of payment systems against AI-driven fraud. The paper contributes to the growing field of digital security, offering insights into the application of GANs for fraud detection in financial services. Keywords- Payment Security, Image Recognition, Generative Adversarial Networks, AI Deepfake, Fraudulent Activities

[484] Knowledge-Driven Federated Graph Learning on Model Heterogeneity

Zhengyu Wu, Guang Zeng, Huilin Lai, Daohan Su, Jishuo Jia, Yinlin Zhu, Xunkai Li, Rong-Hua Li, Guoren Wang, Chenghu Zhou

Main category: cs.LG

TL;DR: FedGKC framework addresses model-centric heterogeneous federated graph learning by using lightweight copilot models for knowledge exchange and dual distillation-aggregation mechanisms, achieving 3.88% accuracy gain over baselines.

Details

Motivation: Existing federated graph learning approaches assume homogeneous client models, but real-world scenarios involve organizations using GNNs of different scales and architectures (MHtFGL). This architectural diversity undermines server-side aggregation and complicates knowledge transfer across clients.

Method: Proposes FedGKC framework with: 1) Lightweight Copilot Model on each client for knowledge exchange despite heterogeneous architectures; 2) Client-side Self-Mutual Knowledge Distillation using bidirectional distillation with multi-view perturbation; 3) Server-side Knowledge-Aware Model Aggregation with dynamic weight assignment based on client knowledge.

Result: Extensive experiments on eight benchmark datasets show FedGKC achieves average accuracy gain of 3.88% over baselines in MHtFGL scenarios while maintaining excellent performance in homogeneous settings.

Conclusion: FedGKC effectively addresses model-centric heterogeneous federated graph learning challenges through knowledge collaboration mechanisms, enabling practical deployment across organizations with diverse GNN architectures while preserving privacy.

Abstract: Federated graph learning (FGL) has emerged as a promising paradigm for collaborative graph representation learning, enabling multiple parties to jointly train models while preserving data privacy. However, most existing approaches assume homogeneous client models and largely overlook the challenge of model-centric heterogeneous FGL (MHtFGL), which frequently arises in practice when organizations employ graph neural networks (GNNs) of different scales and architectures.Such architectural diversity not only undermines smooth server-side aggregation, which presupposes a unified representation space shared across clients’ updates, but also further complicates the transfer and integration of structural knowledge across clients. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework. FedGKC introduces a lightweight Copilot Model on each client to facilitate knowledge exchange while local architectures are heterogeneous across clients, and employs two complementary mechanisms: Client-side Self-Mutual Knowledge Distillation, which transfers effective knowledge between local and copilot models through bidirectional distillation with multi-view perturbation; and Server-side Knowledge-Aware Model Aggregation, which dynamically assigns aggregation weights based on knowledge provided by clients. Extensive experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy gain of 3.88% over baselines in MHtFGL scenarios, while maintaining excellent performance in homogeneous settings.

Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

Main category: cs.LG

TL;DR: TMO is a local-cloud LLM inference system with three-M offloading (multi-modal, multi-task, multi-dialogue) that optimizes where to process requests using reinforcement learning to balance response quality, latency, and cost.

Details

Motivation: LLMs face deployment challenges: local deployment has computational/memory/energy constraints, while cloud deployment lacks real-time guarantees and incurs communication costs. Need a hybrid approach that leverages both local and cloud resources efficiently.

Method: TMO uses a lightweight local LLM for simple tasks and a large-scale cloud LLM for complex multi-modal tasks. It employs resource-constrained reinforcement learning (RCRL) to optimize inference location (local vs. cloud) and multi-modal data sources for each task/dialogue.

Result: TMO significantly outperforms exploration-decision and LLM-as-Agent baselines in latency, cost, and response quality. The paper also introduces M4A1 dataset containing reward/cost metrics across multiple modality, task, dialogue, and LLM configurations.

Conclusion: TMO’s three-M offloading approach with RCRL optimization effectively addresses LLM deployment challenges by intelligently distributing inference between local and cloud resources, achieving better performance than existing solutions.

Abstract: Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

[486] Revisiting Agnostic Boosting

Arthur da Cunha, Mikael Møller Høgsgaard, Andrea Paudice, Yuxin Sun

Main category: cs.LG

TL;DR: Agnostic boosting algorithm with improved sample complexity and matching lower bounds

Details

Motivation: Boosting is well-studied in realizable settings but less understood in agnostic settings where no assumptions are made about label distributions. There's a need for better agnostic boosting algorithms with improved sample complexity.

Method: Proposes a new agnostic boosting algorithm based on reduction to realizable case, followed by margin-based filtering of high-quality hypotheses.

Result: Achieves substantially improved sample complexity compared to prior works under general assumptions, with nearly-matching lower bounds that settle the sample complexity up to logarithmic factors.

Conclusion: The paper provides an efficient agnostic boosting algorithm with optimal sample complexity, closing the gap between upper and lower bounds for weak-to-strong learning in agnostic settings.

Abstract: Boosting is a key method in statistical learning, allowing for converting weak learners into strong ones. While well studied in the realizable case, the statistical properties of weak-to-strong learning remain less understood in the agnostic setting, where there are no assumptions on the distribution of the labels. In this work, we propose a new agnostic boosting algorithm with substantially improved sample complexity compared to prior works under very general assumptions. Our approach is based on a reduction to the realizable case, followed by a margin-based filtering of high-quality hypotheses. Furthermore, we show a nearly-matching lower bound, settling the sample complexity of agnostic boosting up to logarithmic factors.

[487] Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs

Pascal Passigan, Kevin Zhu, Angelina Ning

Main category: cs.LG

TL;DR: The paper presents a graph neural network framework that predicts drug-induced gene expression changes by integrating biomedical knowledge graphs with multimodal embeddings, outperforming baseline models in predicting transcriptional perturbations.

Details

Motivation: Prior deep learning frameworks have focused on link prediction and binary drug-disease associations rather than gene perturbation prediction, which is crucial for understanding drug mechanisms, predicting off-target effects, and identifying drug repurposing opportunities.

Method: Constructed a merged biomedical graph integrating PrimeKG++ (augmented with semantic embeddings) and LINCS L1000 drug/cell line nodes with multimodal embeddings from foundation models (MolFormerXL, BioBERT). Trained a graph attention network (GAT) with a downstream prediction head to learn delta expression profiles of 978 landmark genes for drug-cell pairs.

Result: The framework outperforms MLP baselines for differentially expressed genes prediction under scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization demonstrate that biomedical knowledge graph edges enhance perturbation-level prediction.

Conclusion: The framework provides a path toward mechanistic drug modeling by moving beyond binary drug-disease associations to predict granular transcriptional effects of therapeutic interventions, offering insights into drug mechanisms and potential applications.

Abstract: Understanding how small molecules perturb gene expression is essential for uncovering drug mechanisms, predicting off-target effects, and identifying repurposing opportunities. While prior deep learning frameworks have integrated multimodal embeddings into biomedical knowledge graphs (BKGs) and further improved these representations through graph neural network message-passing paradigms, these models have been applied to tasks such as link prediction and binary drug-disease association, rather than the task of gene perturbation, which may unveil more about mechanistic transcriptomic effects. To address this gap, we construct a merged biomedical graph that integrates (i) PrimeKG++, an augmentation of PrimeKG containing semantically rich embeddings for nodes with (ii) LINCS L1000 drug and cell line nodes, initialized with multimodal embeddings from foundation models such as MolFormerXL and BioBERT. Using this heterogeneous graph, we train a graph attention network (GAT) with a downstream prediction head that learns the delta expression profile of over 978 landmark genes for a given drug-cell pair. Our results show that our framework outperforms MLP baselines for differentially expressed genes (DEG) – which predict the delta expression given a concatenated embedding of drug features, target features, and baseline cell expression – under the scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization further demonstrate that the edges provided by biomedical KGs enhance perturbation-level prediction. More broadly, our framework provides a path toward mechanistic drug modeling: moving beyond binary drug-disease association tasks to granular transcriptional effects of therapeutic intervention.

[488] Adjusted Count Quantification Learning on Graphs

Clemens Damke, Eyke Hüllermeier

Main category: cs.LG

TL;DR: Extends Adjusted Classify & Count (ACC) to graphs, proposes structural importance sampling (SIS) for covariate shift, and introduces Neighborhood-aware ACC for non-homophilic edges.

Details

Motivation: Previous graph quantification learning only used node clustering methods, and the ACC method's prior probability shift assumption often doesn't apply to graph problems.

Method: 1) Extends ACC to graphs, 2) Proposes structural importance sampling (SIS) for structural covariate shift, 3) Develops Neighborhood-aware ACC to handle non-homophilic edges.

Result: Shows effectiveness of proposed techniques on multiple graph quantification tasks, addressing limitations of previous approaches.

Conclusion: The paper introduces novel graph quantification methods that overcome limitations of existing approaches, particularly addressing structural covariate shift and non-homophilic edge challenges.

Abstract: Quantification learning is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular Adjusted Classify & Count (ACC) method to graphs. We show that the prior probability shift assumption upon which ACC relies is often not applicable to graph quantification problems. To address this issue, we propose structural importance sampling (SIS), the first graph quantification method that is applicable under (structural) covariate shift. Additionally, we propose Neighborhood-aware ACC, which improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

Santosh Rajagopalan, Jonathan Vronsky, Songbai Yan, S. Alireza Golestaneh, Shubhra Chandra, Min Zhou

Main category: cs.LG

TL;DR: ALF is a multi-modal transformer model for understanding advertiser behavior across text, image, video, and structured data, achieving SOTA performance on fraud detection, policy violation identification, and advertiser similarity matching with significant real-world impact.

Details

Motivation: The paper aims to create a unified model for understanding advertiser behavior and intent across multiple data modalities (text, image, video, structured data) to improve critical advertising platform tasks like fraud detection and policy enforcement.

Method: ALF uses a multi-modal transformer architecture with contrastive learning and multi-task optimization. Key innovations include multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

Result: ALF achieves state-of-the-art performance, with production deployment showing significant gains: boosting recall by over 40 percentage points on one critical policy and increasing precision to 99.8% on another, delivering simultaneous improvements in both precision and recall.

Conclusion: ALF demonstrates that a unified multi-modal foundation model can effectively capture advertiser behavior patterns and deliver substantial real-world impact on critical advertising platform tasks through its novel architectural innovations.

Abstract: We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video, and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF demonstrates significant real-world impact by delivering simultaneous gains in both precision and recall, for instance boosting recall by over 40 percentage points on one critical policy and increasing precision to 99.8% on another. The architecture’s effectiveness stems from its novel combination of multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

[490] Not All Tokens Are Meant to Be Forgotten

Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Douglas Zytko, Prashant Khanduri, Dongxiao Zhu

Main category: cs.LG

TL;DR: TIF framework improves LLM unlearning by selectively targeting unwanted information while preserving general knowledge, reducing over-forgetting.

Details

Motivation: LLMs memorize private/copyrighted content causing privacy/legal issues. Existing unlearning methods cause over-forgetting, degrading model utility by suppressing all tokens in forget samples.

Method: Targeted Information Forgetting (TIF) framework with: (1) targeted information identifier to differentiate unwanted words (UW) vs general words (GW) in forget samples, (2) Targeted Preference Optimization using Logit Preference Loss to unlearn UW and Preservation Loss to retain GW.

Result: Extensive experiments on TOFU and MUSE benchmarks show TIF enhances unlearning effectiveness while preserving model utility, achieving state-of-the-art results.

Conclusion: TIF framework effectively addresses over-forgetting in LLM unlearning by selectively targeting unwanted information, balancing privacy concerns with model utility preservation.

Abstract: Large Language Models (LLMs), pre-trained on massive text corpora, exhibit remarkable human-level language understanding, reasoning, and decision-making abilities. However, they tend to memorize unwanted information, such as private or copyrighted content, raising significant privacy and legal concerns. Unlearning has emerged as a promising solution, but existing methods face a significant challenge of over-forgetting. This issue arises because they indiscriminately suppress the generation of all the tokens in forget samples, leading to a substantial loss of model utility. To overcome this challenge, we introduce the Targeted Information Forgetting (TIF) framework, which consists of (1) a flexible targeted information identifier designed to differentiate between unwanted words (UW) and general words (GW) in the forget samples, and (2) a novel Targeted Preference Optimization approach that leverages Logit Preference Loss to unlearn unwanted information associated with UW and Preservation Loss to retain general information in GW, effectively improving the unlearning process while mitigating utility degradation. Extensive experiments on the TOFU and MUSE benchmarks demonstrate that the proposed TIF framework enhances unlearning effectiveness while preserving model utility and achieving state-of-the-art results.

[491] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kelu Yao, Sixu Lin, Litao Liu, Changqing Zou

Main category: cs.LG

TL;DR: BiTrajDiff introduces bidirectional trajectory diffusion for offline RL data augmentation, generating both future and history trajectories from intermediate states to improve dataset diversity and performance.

Details

Motivation: Current offline RL methods suffer from distribution bias in static datasets, limiting generalizability. Existing data augmentation techniques only reconstruct future trajectories from given states, ignoring history transitions that could reveal valuable behavior patterns leading to critical high-reward states.

Method: BiTrajDiff uses two independent yet complementary diffusion processes: forward diffusion generates future trajectories to predict dynamics, while backward diffusion generates history trajectories to trace essential transitions. This bidirectional approach leverages critical states as anchors to expand into underexplored state space regions.

Result: Extensive experiments on D4RL benchmark show BiTrajDiff achieves superior performance compared to other advanced data augmentation methods across various offline RL backbones.

Conclusion: Bidirectional trajectory diffusion effectively addresses distribution bias in offline RL by generating diverse trajectories in both temporal directions, enabling better exploration of valuable behavior patterns and improving policy learning.

Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

[492] Mathematical artificial data for operator learning

Heng Wu, Benzhuo Lu

Main category: cs.LG

TL;DR: MAD framework integrates physical laws with data-driven learning for differential equation solving, eliminating need for experimental/simulated training data by generating physics-embedded analytical solutions and synthetic data.

Details

Motivation: Current machine learning approaches for differential equations have limitations: data-driven methods require expensive labeled datasets, while model-driven techniques face efficiency-accuracy trade-offs. There's a need for a method that combines mathematical rigor with computational efficiency.

Method: The Mathematical Artificial Data (MAD) framework exploits differential equations’ intrinsic mathematical structure to generate physics-embedded analytical solutions and associated synthetic data. This creates a physics-embedded-data-driven approach that eliminates dependence on experimental or simulated training data.

Result: MAD demonstrates generalizability and superior efficiency/accuracy across various differential equation scenarios, particularly in 2D parametric problems where both boundary values and source terms are functions. It enables computationally efficient operator learning across multi-parameter systems while maintaining mathematical rigor.

Conclusion: MAD represents a new paradigm that integrates physical laws with data-driven learning, with potential to become a universal framework for physics-informed machine intelligence in scientific computing, especially for handling complex parameter spaces.

Abstract: Machine learning has emerged as a transformative tool for solving differential equations (DEs), yet prevailing methodologies remain constrained by dual limitations: data-driven methods demand costly labeled datasets while model-driven techniques face efficiency-accuracy trade-offs. We present the Mathematical Artificial Data (MAD) framework, a new paradigm that integrates physical laws with data-driven learning to facilitate large-scale operator discovery. By exploiting DEs’ intrinsic mathematical structure to generate physics-embedded analytical solutions and associated synthetic data, MAD fundamentally eliminates dependence on experimental or simulated training data. This enables computationally efficient operator learning across multi-parameter systems while maintaining mathematical rigor. Through numerical demonstrations spanning 2D parametric problems where both the boundary values and source term are functions, we showcase MAD’s generalizability and superior efficiency/accuracy across various DE scenarios. This physics-embedded-data-driven framework and its capacity to handle complex parameter spaces gives it the potential to become a universal paradigm for physics-informed machine intelligence in scientific computing.

[493] Sampling from Gaussian Processes: A Tutorial and Applications in Global Sensitivity Analysis and Optimization

Bach Do, Nafeezat A. Ajenifuja, Taiwo A. Adebiyi, Ruda Zhang

Main category: cs.LG

TL;DR: The paper presents efficient sampling methods (random Fourier features and pathwise conditioning) for Gaussian processes to enable global sensitivity analysis and optimization in engineering applications where high-fidelity simulations are expensive.

Details

Motivation: High-fidelity simulations and physical experiments are too expensive for global sensitivity analysis (GSA) and optimization tasks. Gaussian processes serve as proxy models but direct sampling from them is computationally inefficient due to infinite-dimensional nature and large covariance matrix operations.

Method: The paper presents two efficient sampling methods: random Fourier features and pathwise conditioning for generating posterior samples from Gaussian processes at reduced computational cost. Alternative approaches are also briefly described.

Result: The paper demonstrates successful applications of these sampling methods through a series of numerical examples, showing how generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization.

Conclusion: Efficient sampling methods for Gaussian processes enable practical implementation of global sensitivity analysis and optimization in engineering applications where computational resources are limited, bridging a gap between machine learning techniques and engineering optimization practice.

Abstract: High-fidelity simulations and physical experiments are essential for engineering analysis and design, yet their high cost often makes two critical tasks–global sensitivity analysis (GSA) and optimization–prohibitively expensive. This limitation motivates the common use of Gaussian processes (GPs) as proxy regression models that provide uncertainty-aware predictions from a limited number of high-quality observations. GPs naturally enable efficient sampling strategies that support informed decision-making under uncertainty by extracting information from a subset of possible functions for the model of interest. However, direct sampling from GPs is inefficient due to their infinite-dimensional nature and the high cost associated with large covariance matrix operations. Despite their popularity in machine learning and statistics communities, sampling from GPs has received little attention in the community of engineering optimization. In this paper, we present the formulation and detailed implementation of two notable sampling methods–random Fourier features and pathwise conditioning–for generating posterior samples from GPs at reduced computational cost. Alternative approaches are briefly described. Importantly, we detail how the generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization. We show successful applications of these sampling methods through a series of numerical examples.

[494] Learning Network Dismantling Without Handcrafted Inputs

Haozhe Tian, Pietro Ferraro, Robert Shorten, Mahdi Jalili, Homayoun Hamedmoghadam

Main category: cs.LG

TL;DR: MIND is a message-passing GNN that eliminates handcrafted features using attention and message-iteration profiles, trained on synthetic networks to solve network dismantling, generalizing to million-node real networks.

Details

Motivation: Current message-passing GNNs for network science problems rely on handcrafted structural features, which increase computational cost and introduce bias into otherwise data-driven representations.

Method: Introduces attention mechanism and message-iteration profiles to eliminate handcrafted features, plus an algorithmic approach to generate structurally diverse training sets of small synthetic networks.

Result: MIND model trained solely on synthetic networks generalizes to large, unseen real networks with millions of nodes, outperforming state-of-the-art network dismantling methods.

Conclusion: The proposed framework increases efficiency and generalizability, with potential applications beyond network dismantling to a range of complex network problems.

Abstract: The application of message-passing Graph Neural Networks has been a breakthrough for important network science problems. However, the competitive performance often relies on using handcrafted structural features as inputs, which increases computational cost and introduces bias into the otherwise purely data-driven network representations. Here, we eliminate the need for handcrafted features by introducing an attention mechanism and utilizing message-iteration profiles, in addition to an effective algorithmic approach to generate a structurally diverse training set of small synthetic networks. Thereby, we build an expressive message-passing framework and use it to efficiently solve the NP-hard problem of Network Dismantling, virtually equivalent to vital node identification, with significant real-world applications. Trained solely on diversified synthetic networks, our proposed model – MIND: Message Iteration Network Dismantler – generalizes to large, unseen real networks with millions of nodes, outperforming state-of-the-art network dismantling methods. Increased efficiency and generalizability of the proposed model can be leveraged beyond dismantling in a range of complex network problems.

[495] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

Ali Taheri, Alireza Taban, Shanshan Ye, Abdolreza Mirzaei, Tongliang Liu, Bo Han

Main category: cs.LG

TL;DR: The paper proposes a token categorization approach for supervised fine-tuning (SFT) that separates tokens into positive (useful) and negative (unhelpful/misleading) categories, with explicit forgetting of negative tokens to improve model performance.

Details

Motivation: SFT effectiveness depends heavily on data quality and volume, and poor quality data can lead to limited performance gains or even degradation. The authors aim to reduce this dependency by identifying and handling unhelpful tokens during fine-tuning.

Method: Categorize tokens in training corpus into positive (useful for performance improvement) and negative (lacking essential semantics or misleading) tokens. Positive tokens are trained normally, while negative tokens undergo explicit forgetting mechanisms to prevent harmful learning.

Result: Experiments across diverse benchmarks using various model architectures demonstrate that the forgetting mechanism enhances model performance compared to standard SFT approaches.

Conclusion: Token categorization with explicit forgetting of negative tokens helps models learn more precisely by establishing knowledge boundaries, reducing reliance on data quality/volume, and improving overall SFT effectiveness.

Abstract: Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts – positive and negative tokens – based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments across diverse and well-established benchmarks using various model architectures, demonstrating that this forgetting mechanism enhances model performance.

[496] Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications

Zijian Liu

Main category: cs.LG

TL;DR: This paper analyzes Online Convex Optimization (OCO) algorithms in heavy-tailed settings where gradients only have finite p-th moments (1<p≤2), showing classical methods like Online Gradient Descent achieve optimal regret without modifications like gradient clipping.

Details

Motivation: Most OCO algorithms assume finite variance in stochastic gradients, but limited results exist for heavy-tailed distributions where gradients only have finite p-th central moments (1Method: The paper examines existing OCO algorithms (e.g., Online Gradient Descent) without any modifications under heavy-tailed noise. It establishes new regret bounds for these classical methods under the standard bounded domain assumption, showing they remain effective without operations like gradient clipping.

Result: The paper proves that classical OCO algorithms achieve fully optimal regret bounds in heavy-tailed settings, with bounds that are optimal in all parameters and can be achieved without knowing p. These results enable the first provable and optimal convergence for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping.

Conclusion: OCO with heavy-tailed gradients can be solved effectively using classical algorithms without modifications like gradient clipping. The results extend to broader settings including smooth OCO and optimistic algorithms, providing a unified framework for handling different heavy-tailed cases.

Abstract: In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\mathsf{p}$-th central moment for some $\mathsf{p}\in\left(1,2\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable and optimal convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping. Furthermore, we explore broader settings (e.g., smooth OCO) and extend our ideas to optimistic algorithms to handle different cases simultaneously.

[497] CrystalDiT: A Diffusion Transformer for Crystal Generation

Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao

Main category: cs.LG

TL;DR: CrystalDiT is a simple diffusion transformer for crystal structure generation that outperforms complex methods by treating lattice and atomic properties as a unified system.

Details

Motivation: The paper challenges the trend of using complex, multi-stream architectures for crystal structure generation, arguing that in data-limited scientific domains, simpler architectures with careful design can outperform sophisticated alternatives that are prone to overfitting.

Method: CrystalDiT employs a unified transformer architecture that treats lattice and atomic properties as a single, interdependent system. It uses a periodic table-based atomic representation and a balanced training strategy, avoiding the intricate multi-stream designs of previous methods.

Result: Achieves 8.78% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.21%) and MatterGen (3.66%). Generates 63.28% unique and novel structures while maintaining comparable stability rates.

Conclusion: Architectural simplicity can be more effective than complexity for materials discovery, especially in data-limited scientific domains where sophisticated alternatives are prone to overfitting.

Abstract: We present CrystalDiT, a diffusion transformer for crystal structure generation that achieves state-of-the-art performance by challenging the trend of architectural complexity. Instead of intricate, multi-stream designs, CrystalDiT employs a unified transformer that imposes a powerful inductive bias: treating lattice and atomic properties as a single, interdependent system. Combined with a periodic table-based atomic representation and a balanced training strategy, our approach achieves 8.78% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.21%) and MatterGen (3.66%). Notably, CrystalDiT generates 63.28% unique and novel structures while maintaining comparable stability rates, demonstrating that architectural simplicity can be more effective than complexity for materials discovery. Our results suggest that in data-limited scientific domains, carefully designed simple architectures outperform sophisticated alternatives that are prone to overfitting.

[498] Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature

V. S. Usatyuk, D. A. Sapoznikov, S. I. Egorov

Main category: cs.LG

TL;DR: Physics-inspired graph-based classifier using Ising spins on QC-LDPC graphs achieves high accuracy with extreme feature compression for multi-class image classification.

Details

Motivation: Traditional CNN feature vectors are computationally expensive and obscure data geometry, while conventional graph-based classifiers degrade on natural multi-class images due to complex feature manifold topology.

Method: Treat frozen MobileNetV2 embeddings as Ising spins on sparse Multi-Edge Type QC-LDPC graphs forming Random Bond Ising Model, tuned to Nishimori temperature where smallest Bethe-Hessian eigenvalue vanishes. Uses spectral-topological correspondence linking graph trapping sets to invariants via Ihara-Bass zeta function, and quadratic-Newton estimator for Nishimori temperature.

Result: Achieves 98.7% top-1 accuracy on ImageNet-10 and 84.92% on ImageNet-100 with 3-graph soft ensemble. Compresses 1280-dimensional MobileNetV2 features to 32 dimensions (ImageNet-10) and 64 dimensions (ImageNet-100). Hard ensemble increases top-1 by 0.1% over MobileNetV2 while cutting FLOPs by 2.67x; soft ensemble drops top-1 by only 1.09% vs ResNet50 while reducing FLOPs by 29x.

Conclusion: Topology-guided LDPC embedding produces highly compressed, accurate classifiers suitable for resource-constrained deployment, with innovations in linking trapping sets to topological defects, efficient Nishimori temperature estimation, and demonstrating practical compression-performance tradeoffs.

Abstract: Modern multi-class image classification relies on high-dimensional CNN feature vectors, which are computationally expensive and obscure the underlying data geometry. Conventional graph-based classifiers degrade on natural multi-class images because typical graphs fail to preserve separability on feature manifolds with complex topology. We address this with a physics-inspired pipeline frozen MobileNetV2 embeddings are treated as Ising spins on a sparse Multi-Edge Type QC-LDPC graph forming a Random Bond Ising Model. The system is tuned to its Nishimori temperature identified where the smallest Bethe-Hessian eigenvalue vanishes. Our method rests on two innovations: we prove a spectral-topological correspondence linking graph trapping sets to invariants via the Ihara-Bass zeta function removing these structures boosts top-1 accuracy over four-fold in multi-class settings; we develop a quadratic-Newton estimator for the Nishimori temperature converging in around 9 Arnoldi iterations for a 6-times speedup enabling spectral embedding on scales like ImageNet-100. The resulting graphs compress 1280-dimensional MobileNetV2 features to 32 dimensions for ImageNet10 and 64 for ImageNet-100 We achieve 98.7% top-1 accuracy on ImageNet-10 and 84.92% on ImageNet-100 with a three-graph soft ensemble Versus MobileNetV2 our hard ensemble increases top-1 by 0.1% while cutting FLOPs by 2.67-times compared to ResNet50 the soft ensemble drops top1 by only 1.09% yet reduces FLOPs by 29-times. Novelty lies in (a) rigorously linking trapping sets to topological defects, (b) an efficient Nishimori temperature estimator and (c) demonstrating that topology-guided LDPC embedding produces highly compressed accurate classifiers for resource-constrained deployment

[499] Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking

Xingchen Wang, Feijie Wu, Chenglin Miao, Tianchun Li, Haoyu Hu, Qiming Cao, Jing Gao, Lu Su

Main category: cs.LG

TL;DR: PM-SFL: A privacy-preserving Split Federated Learning framework using probabilistic mask training to protect against data reconstruction attacks while maintaining model utility, with personalization for data heterogeneity and knowledge compensation for system heterogeneity.

Details

Motivation: Split Federated Learning (SFL) reduces client computation but introduces privacy risks from exchanging intermediate activations, making data reconstruction attacks possible. Existing noise-based defenses degrade model performance, creating a need for better privacy-preserving methods that maintain utility.

Method: PM-SFL uses probabilistic mask training to add structured randomness without explicit noise, personalized mask learning to handle data heterogeneity by tailoring submodels to each client, and layer-wise knowledge compensation for system heterogeneity to enable participation across varying resource levels with adaptive model splitting.

Result: Theoretical analysis confirms privacy protection, and experiments on image and wireless sensing tasks show PM-SFL improves accuracy, communication efficiency, and robustness to privacy attacks, with strong performance under data and system heterogeneity.

Conclusion: PM-SFL provides an effective solution for privacy-preserving Split Federated Learning that addresses both privacy risks and heterogeneity challenges while maintaining model utility and system efficiency.

Abstract: Split Federated Learning (SFL) has emerged as an efficient alternative to traditional Federated Learning (FL) by reducing client-side computation through model partitioning. However, exchanging of intermediate activations and model updates introduces significant privacy risks, especially from data reconstruction attacks that recover original inputs from intermediate representations. Existing defenses using noise injection often degrade model performance. To overcome these challenges, we present PM-SFL, a scalable and privacy-preserving SFL framework that incorporates Probabilistic Mask training to add structured randomness without relying on explicit noise. This mitigates data reconstruction risks while maintaining model utility. To address data heterogeneity, PM-SFL employs personalized mask learning that tailors submodel structures to each client’s local data. For system heterogeneity, we introduce a layer-wise knowledge compensation mechanism, enabling clients with varying resources to participate effectively under adaptive model splitting. Theoretical analysis confirms its privacy protection, and experiments on image and wireless sensing tasks demonstrate that PM-SFL consistently improves accuracy, communication efficiency, and robustness to privacy attacks, with particularly strong performance under data and system heterogeneity.

[500] Personalized Enhanced Federated Multi-View Clustering via Heat-Kernel Tensor Decomposition

Kristina P. Sinaga

Main category: cs.LG

TL;DR: This paper proposes novel mathematical frameworks for multi-view clustering in federated learning, using quantum-inspired heat-kernel coefficients and tensor decomposition methods (PARAFAC2/Tucker) to address data heterogeneity and privacy concerns.

Details

Motivation: The motivation is to address challenges in multi-view clustering within federated learning environments, specifically dealing with data heterogeneity across different views/sources while ensuring privacy protection and efficient communication between distributed clients.

Method: The method integrates optimization techniques using heat-kernel coefficients (quantum-inspired measures) instead of conventional distance metrics, combined with advanced tensor decomposition methods (PARAFAC2 and Tucker decomposition) to represent high-dimensional multi-view data while preserving inter-view relationships.

Result: The research developed four novel algorithms: E-FKMVC (efficient federated kernel multi-view clustering), FedHK-PARAFAC2, FedHK-Tucker, and FedHK-MVC-Person (personalized FedHK-PARAFAC2). Theoretical analyses provide convergence guarantees, privacy bounds, and complexity proofs to validate the methods.

Conclusion: The paper makes significant contributions to federated multi-view clustering through innovative integration of mathematical modeling and algorithm design, addressing critical challenges of data heterogeneity and privacy concerns, enabling enhanced data management and analytics in various applications.

Abstract: This paper introduces mathematical frameworks that address the challenges of multi-view clustering in federated learning environments. The objective is to integrate optimization techniques based on new objective functions employing heat-kernel coefficients to replace conventional distance metrics with quantum-inspired measures. The proposed frameworks utilize advanced tensor decomposition methods, specifically, PARAFAC2 and Tucker decomposition to efficiently represent high-dimensional, multi-view data while preserving inter-view relationships. The research has yielded the development of four novel algorithms, an efficient federated kernel multi-view clustering (E-FKMVC) model, FedHK-PARAFAC2, FedHK-Tucker, and FedHK-MVC-Person with PARAFAC2 Decomposition (Personalized FedHK-PARAFAC2). The primary objective of these algorithms is to enhance the efficacy of clustering processes while ensuring the confidentiality and efficient communication in federated learning environments. Theoretical analyses of convergence guarantees, privacy bounds, and complexity are provided to validate the effectiveness of the proposed methods. In essence, this paper makes a significant academic contribution to the field of federated multi-view clustering through its innovative integration of mathematical modeling and algorithm design. This approach addresses the critical challenges of data heterogeneity and privacy concerns, paving the way for enhanced data management and analytics in various contexts.

[501] Graph Learning is Suboptimal in Causal Bandits

Mohammad Shahverdikondori, Jalal Etesami, Negar Kiyavash

Main category: cs.LG

TL;DR: Learning causal parent sets is suboptimal for regret minimization in causal bandits; parent identification conflicts with regret minimization, and bypassing graph recovery leads to better performance.

Details

Motivation: Previous causal bandit work focused on identifying reward parents or jointly learning parents while minimizing regret. The authors investigate whether these strategies are optimal, questioning if parent identification is necessary for effective regret minimization.

Method: Prove theoretical results showing regret minimization and parent identification are conflicting objectives. Analyze both known and unknown parent set size regimes, establish novel regret lower bounds capturing combinatorial action space structure. Propose nearly optimal algorithms that bypass graph and parent recovery entirely.

Result: Counterintuitive finding: learning parent set is suboptimal for regret minimization. Existence of instances where regret minimization and parent identification are fundamentally conflicting. Novel algorithms outperform existing baselines with large performance gaps in experiments.

Conclusion: Parent identification is unnecessary for regret minimization in causal bandits. Bypassing graph recovery leads to better performance, challenging conventional approaches that focus on learning causal structure.

Abstract: We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward’s parents and then applying classic bandit methods to them, or jointly learning the parents while minimizing regret. We investigate whether such strategies are optimal. Somewhat counterintuitively, our results show that learning the parent set is suboptimal. We do so by proving that there exist instances where regret minimization and parent identification are fundamentally conflicting objectives. We further analyze both the known and unknown parent set size regimes, establish novel regret lower bounds that capture the combinatorial structure of the action space. Building on these insights, we propose nearly optimal algorithms that bypass graph and parent recovery, demonstrating that parent identification is indeed unnecessary for regret minimization. Experiments confirm that there exists a large performance gap between our method and existing baselines in various environments.

[502] Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

Yoonho Lee, Joseph Boen, Chelsea Finn

Main category: cs.LG

TL;DR: Feedback Descent optimizes text artifacts (prompts, code, molecules) using structured textual feedback instead of scalar rewards, enabling directed optimization in text space without weight updates.

Details

Motivation: Traditional preference learning compresses detailed critiques into binary preferences, creating an information bottleneck. The paper aims to preserve high-bandwidth textual feedback to enable more effective optimization of text artifacts.

Method: Uses structured textual feedback paired with comparisons as supervision. In-context learning transforms feedback into gradient-like directional information for targeted edits. Optimization occurs purely at inference time without modifying model weights, making it task-agnostic.

Result: Outperforms state-of-the-art methods: GEPA (prompt optimization), GRPO and REINVENT (reinforcement learning), and specialized graph-based molecular optimizers. On DOCKSTRING benchmark, discovers novel drug-like molecules surpassing 99.9th percentile of 260,000+ compounds across six protein targets.

Conclusion: Feedback Descent demonstrates that preserving structured textual feedback rather than compressing it to binary preferences enables more effective optimization of text artifacts across diverse domains, offering a powerful alternative to traditional preference learning approaches.

Abstract: We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts – prompts, code, and molecules – through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.

[503] Complex variational autoencoders admit Kähler structure

Andrew Gracyk

Main category: cs.LG

TL;DR: Complex VAEs reveal Kähler geometric structure, with efficient Fisher metric computation via Kähler potential derivatives, enabling smoother latent representations through decoder geometry regularization.

Details

Motivation: While latent-Euclidean VAEs have shown Riemannian structure, this paper explores complex VAEs with complex latent spaces to uncover Kähler geometric structure, aiming to develop more efficient computational methods for Fisher information metrics.

Method: Adapts arguments from real VAEs to complex VAEs, derives Fisher information metric for complex Gaussian latent variables, proposes Kähler potential derivative of complex Gaussian mixtures as efficient proxy for Fisher metric, leverages law of total covariance, and regularizes latent space with decoder geometry.

Result: Demonstrates that complex VAEs reveal Kähler structure, develops efficient computation method for Fisher metric via plurisubharmonic potential function, shows regularization with decoder geometry yields smoother representations and fewer semantic outliers, though with some sample variation trade-off.

Conclusion: Complex VAEs naturally exhibit Kähler geometry, enabling efficient Fisher metric computation through potential derivatives and improved latent space regularization, leading to better representation quality despite sample variation trade-offs.

Abstract: It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kähler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kähler potential relation is exactly achieved under relative entropy. We propose a Kähler potential derivative of complex Gaussian mixtures that acts as a rough proxy to the Fisher information metric while still being faithful to the underlying Kähler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. Our methods leverage the law of total covariance to bridge behavior between our potential and the Fisher metric. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.

[504] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Malyaban Bal, Abhronil Sengupta

Main category: cs.LG

TL;DR: GRASP is a lightweight PEFT method that groups token representations and learns shared scaling/shifting vectors, reducing parameters by orders of magnitude while matching/exceeding existing methods. StochGRASP adds probabilistic perturbations for hardware noise robustness.

Details

Motivation: Parameter-efficient fine-tuning needs to be even more efficient while maintaining performance, and real-world deployment requires robustness to hardware-level variability and noise in emerging AI hardware platforms.

Method: GRASP partitions token representations into groups (K « D) and learns shared scaling/shifting vectors per group. StochGRASP extends this with Gaussian distributions as perturbations to weights and a noise-aware loss function for hardware variability modeling.

Result: GRASP matches/exceeds established PEFT methods (LoRA, BitFit) with order-of-magnitude parameter reduction on GLUE (RoBERTa) and E2E NLG (GPT-2). StochGRASP outperforms deterministic variants under varying noise levels.

Conclusion: GRASP provides highly efficient PEFT with minimal parameters, while StochGRASP enables robust deployment on energy-efficient, noise-prone edge hardware through probabilistic parameterization.

Abstract: Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K « D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base & RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.

[505] A Context-Aware Temporal Modeling through Unified Multi-Scale Temporal Encoding and Hierarchical Sequence Learning for Single-Channel EEG Sleep Staging

Amirali Vakili, Salar Jahanshiri, Armin Salimi-Badr

Main category: cs.LG

TL;DR: A context-aware and interpretable framework for single-channel EEG sleep staging that improves N1 stage detection using multi-scale feature extraction, temporal modeling, and imbalance handling techniques.

Details

Motivation: Automatic sleep staging is crucial for addressing global sleep disorders. Single-channel EEG is practical but existing methods suffer from class imbalance (especially for N1 stage), limited receptive-field modeling, and lack of interpretability in black-box models.

Method: Combines compact multi-scale feature extraction with temporal modeling to capture local and long-range dependencies. Uses class-weighted loss functions and data augmentation to handle imbalance. Segments EEG signals into sub-epoch chunks and averages softmax probabilities across chunks for contextual representation.

Result: Achieves 89.72% overall accuracy and 85.46% macro-average F1-score. Notably achieves 61.7% F1-score for the challenging N1 stage, showing substantial improvement over previous methods on SleepEDF datasets.

Conclusion: The proposed approach effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications, with particular success in detecting the difficult N1 stage.

Abstract: Automatic sleep staging is a critical task in healthcare due to the global prevalence of sleep disorders. This study focuses on single-channel electroencephalography (EEG), a practical and widely available signal for automatic sleep staging. Existing approaches face challenges such as class imbalance, limited receptive-field modeling, and insufficient interpretability. This work proposes a context-aware and interpretable framework for single-channel EEG sleep staging, with particular emphasis on improving detection of the N1 stage. Many prior models operate as black boxes with stacked layers, lacking clearly defined and interpretable feature extraction roles.The proposed model combines compact multi-scale feature extraction with temporal modeling to capture both local and long-range dependencies. To address data imbalance, especially in the N1 stage, classweighted loss functions and data augmentation are applied. EEG signals are segmented into sub-epoch chunks, and final predictions are obtained by averaging softmax probabilities across chunks, enhancing contextual representation and robustness.The proposed framework achieves an overall accuracy of 89.72% and a macro-average F1-score of 85.46%. Notably, it attains an F1- score of 61.7% for the challenging N1 stage, demonstrating a substantial improvement over previous methods on the SleepEDF datasets. These results indicate that the proposed approach effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications.

[506] Evaluating Parameter Efficient Methods for RLVR

Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu

Main category: cs.LG

TL;DR: Comprehensive evaluation of PEFT methods for RLVR shows structural variants outperform standard LoRA, SVD-based methods fail due to spectral collapse, and extreme parameter reduction bottlenecks reasoning.

Details

Motivation: While PEFT methods like LoRA are commonly used in RLVR (Reinforcement Learning with Verifiable Rewards), the optimal PEFT architecture for RLVR remains unidentified, creating a need for systematic evaluation.

Method: Conducted first comprehensive evaluation of over 12 PEFT methodologies on DeepSeek-R1-Distill families using mathematical reasoning benchmarks, including structural variants, SVD-informed methods, and extreme parameter reduction techniques.

Result: 1) Structural variants (DoRA, AdaLoRA, MiSS) consistently outperform standard LoRA. 2) SVD-informed methods (PiSSA, MiLoRA) fail due to spectral collapse from misalignment between principal-component updates and RL optimization. 3) Extreme parameter reduction (VeRA, Rank-1) severely bottlenecks reasoning capacity.

Conclusion: The default adoption of standard LoRA for RLVR is suboptimal; structural PEFT variants perform better, while SVD-based methods fail and extreme parameter reduction harms reasoning. This work provides guidance for parameter-efficient RL methods and calls for more exploration in this area.

Abstract: We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

[507] ISOPO: Proximal policy gradients without pi-old

Nilin Abrahamsen

Main category: cs.LG

TL;DR: ISOPO is a single-step method to approximate natural policy gradient, more efficient than multi-step proximal policy methods like GRPO/CISPO.

Details

Motivation: Existing proximal policy optimization methods require multiple gradient steps with importance ratio clipping to approximate natural gradient steps, which is computationally expensive. ISOPO aims to provide a more efficient single-step approximation.

Method: ISOPO normalizes log-probability gradients in the Fisher metric before contracting with advantages. A variant transforms microbatch advantages using neural tangent kernel in each layer, applied layer-wise in a single backward pass.

Result: ISOPO achieves natural policy gradient approximation with negligible computational overhead compared to vanilla REINFORCE, requiring only a single gradient step.

Conclusion: ISOPO provides an efficient single-step method for natural policy gradient approximation, outperforming existing multi-step proximal policy methods in computational efficiency.

Abstract: This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.

[508] End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun

Main category: cs.LG

TL;DR: TTT-E2E: A test-time training approach that treats long-context modeling as continual learning, using sliding-window attention with meta-learning to compress context into model weights at inference time.

Details

Motivation: To address long-context language modeling by reframing it as a continual learning problem rather than focusing on architectural changes, enabling efficient processing of long sequences without modifying the core Transformer architecture.

Method: Uses standard Transformer with sliding-window attention, but continues learning at test time via next-token prediction on the given context, compressing context into model weights. Improves initialization via meta-learning during training. The approach is end-to-end both at test time (via next-token prediction) and training time (via meta-learning).

Result: For 3B models trained with 164B tokens, TTT-E2E scales with context length similarly to Transformer with full attention, while other methods like Mamba 2 and Gated DeltaNet do not. Achieves constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context.

Conclusion: Test-time training with meta-learning provides an effective approach to long-context modeling that combines the scaling benefits of full attention Transformers with the constant inference latency of RNNs, offering a promising direction for efficient long-sequence processing.

Abstract: We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

cs.MA

[509] Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

Shaurya Mallampati, Rashed Shelim, Walid Saad, Naren Ramakrishnan

Main category: cs.MA

TL;DR: LLM agents with game-theoretic reasoning and real-time adaptation outperform baselines in dynamic multi-agent cooperative environments.

Details

Motivation: While LLMs show strong reasoning in static tasks, their capabilities in dynamic, real-time multi-agent scenarios (like cooperative gameplay) remain unexplored. The paper aims to bridge this gap by combining LLM-driven agents with strategic reasoning in continuously adapting environments.

Method: Proposes a framework combining LLM-driven agents with game-theoretic principles (belief consistency, Nash equilibrium) for dynamic multi-agent environments. Includes real-time strategy refinement and adaptive feedback mechanisms that allow agents to dynamically adjust policies based on immediate contextual interactions.

Result: Achieves up to 26% improvement in return over PPO baselines in high-noise environments while maintaining real-time latency under 1.05 milliseconds. Improves collaboration efficiency, task completion rates, and flexibility.

Conclusion: Game-theoretic guidance integrated with real-time feedback enhances LLM performance in dynamic multi-agent systems, fostering more resilient and flexible strategic agents capable of real-time adaptation in cooperative scenarios.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities across mathematical, strategic, and linguistic tasks, yet little is known about how well they reason in dynamic, real-time, multi-agent scenarios, such as collaborative environments in which agents continuously adapt to each other’s behavior, as in cooperative gameplay settings. In this paper, we bridge this gap by combining LLM-driven agents with strategic reasoning and real-time adaptation in cooperative, multi-agent environments grounded in game-theoretic principles such as belief consistency and Nash equilibrium. The proposed framework applies broadly to dynamic scenarios in which agents coordinate, communicate, and make decisions in response to continuously changing conditions. We provide real-time strategy refinement and adaptive feedback mechanisms that enable agents to dynamically adjust policies based on immediate contextual interactions, in contrast to previous efforts that evaluate LLM capabilities in static or turn-based settings. Empirical results show that our method achieves up to a 26% improvement in return over PPO baselines in high-noise environments, while maintaining real-time latency under 1.05 milliseconds. Our approach improves collaboration efficiency, task completion rates, and flexibility, illustrating that game-theoretic guidance integrated with real-time feedback enhances LLM performance, ultimately fostering more resilient and flexible strategic multi-agent systems.

cs.MM

[510] GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, Hongtao Xie

Main category: cs.MM

TL;DR: GestureHYDRA: A hybrid-modality diffusion transformer system for co-speech gesture generation that focuses on semantically explicit hand gestures, using a cascaded retrieval-augmented generation strategy and novel motion-style injective transformer layers.

Details

Motivation: Most previous co-speech gesture synthesis works neglect hand gestures with explicit and essential semantics that deliver instructional information. The paper aims to address this gap by focusing on specific hand gesture activation that conveys more meaningful information than common body movements.

Method: 1) Built a high-quality 3D human body movement dataset with semantically explicit hand gestures from live streamers. 2) Developed GestureHYDRA - a hybrid-modality diffusion transformer architecture with motion-style injective transformer layers. 3) Introduced cascaded retrieval-augmented generation strategy using a semantic gesture repository per subject. 4) Implemented adaptive audio-gesture synchronization mechanism.

Result: Quantitative and qualitative experiments demonstrate superior performance over all counterparts. The approach substantially improves semantic gesture activation and production efficiency.

Conclusion: The proposed GestureHYDRA system effectively generates co-speech gestures with semantically explicit hand gestures, addressing the limitation of previous works that neglect meaningful hand gesture semantics. The hybrid-modality approach with retrieval-augmented generation enables versatile gesture operations and improved semantic activation.

Abstract: While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency. Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts. The project page can be found at https://mumuwei.github.io/GestureHYDRA/.

eess.AS

[511] Regularized autoregressive modeling and its application to audio signal reconstruction

Ondřej Mokrý, Pavel Rajmic

Main category: eess.AS

TL;DR: Proposes a generic framework for regularizing/constraining autoregressive models in signal processing, with applications to audio declipping and dequantization.

Details

Motivation: Existing approaches for regularizing AR models in speech/audio processing lack a comprehensive, generic framework that can incorporate various constraints and prior information systematically.

Method: Develops an encompassing optimization framework and algorithm for constrained/regularized AR modeling, with analysis of computational demands and convergence improvements.

Result: The method demonstrates competitiveness in declipping musical signals and superiority in declipping speech compared to state-of-the-art methods, including a heuristic GLP algorithm.

Conclusion: The proposed framework provides a comprehensive solution for constrained AR modeling in signal processing, showing strong performance in practical audio restoration tasks.

Abstract: Autoregressive (AR) modeling is invaluable in signal processing, in particular in speech and audio fields. Attempts in the literature can be found that regularize or constrain either the time-domain signal values or the AR coefficients, which is done for various reasons, including the incorporation of prior information or numerical stabilization. Although these attempts are appealing, an encompassing and generic modeling framework is still missing. We propose such a framework and the related optimization problem and algorithm. We discuss the computational demands of the algorithm and explore the effects of various improvements on its convergence speed. In the experimental part, we demonstrate the usefulness of our approach on the audio declipping and dequantization problems. We compare its performance against state-of-the-art methods and demonstrate the competitiveness of the proposed method in declipping musical signals, and its superiority in declipping speech. The evaluation includes a heuristic algorithm of generalized linear prediction (GLP), a strong competitor which has only been presented as a patent and is new in the scientific community.

eess.IV

[512] Leveraging Machine Learning for Early Detection of Lung Diseases

Bahareh Rahmani, Harsha Reddy Bindela, Rama Kanth Reddy Gosula, Krishna Yedubati, Mohammad Amir Salari, Leslie Hinyard, Payam Norouzzadeh, Eli Snir, Martin Schoen

Main category: eess.IV

TL;DR: Deep learning models (CNNs, VGG16, InceptionV3, EfficientNetB0) achieve high accuracy in diagnosing respiratory diseases (COVID-19, lung cancer, pneumonia) from chest X-rays, enabling rapid, non-invasive diagnostics for resource-limited areas.

Details

Motivation: To address healthcare disparities by providing rapid, accurate, and non-invasive diagnostic solutions for respiratory diseases, particularly in areas with limited access to radiologists and healthcare resources, through predictive and preventive healthcare approaches.

Method: Combined traditional image processing with advanced neural networks; trained and validated multiple deep learning models including CNNs, VGG16, InceptionV3, and EfficientNetB0 on chest X-ray data for diagnosing respiratory diseases.

Result: Models achieved high accuracy, precision, recall, and F1 scores, demonstrating reliability and potential for real-world diagnostic applications in detecting COVID-19, lung cancer, and pneumonia from chest X-rays.

Conclusion: Deep learning-based diagnostic systems can significantly impact patient outcomes by providing accessible, rapid, and accurate respiratory disease detection, especially in resource-limited settings, marking a shift toward predictive and preventive healthcare.

Abstract: A combination of traditional image processing methods with advanced neural networks concretes a predictive and preventive healthcare paradigm. This study offers rapid, accurate, and non-invasive diagnostic solutions that can significantly impact patient outcomes, particularly in areas with limited access to radiologists and healthcare resources. In this project, deep learning methods apply in enhancing the diagnosis of respiratory diseases such as COVID-19, lung cancer, and pneumonia from chest x-rays. We trained and validated various neural network models, including CNNs, VGG16, InceptionV3, and EfficientNetB0, with high accuracy, precision, recall, and F1 scores to highlight the models’ reliability and potential in real-world diagnostic applications.

[513] Targeted Semantic Segmentation of Himalayan Glacial Lakes Using Time-Series SAR: Towards Automated GLOF Early Warning

Pawan Adhikari, Satish Raj Regmi, Hari Ram Shrestha

Main category: eess.IV

TL;DR: Automated deep learning pipeline using Sentinel-1 SAR time-series for targeted monitoring of high-risk Himalayan glacial lakes, achieving 0.9130 IoU with temporal-first training strategy and operational Docker architecture.

Details

Motivation: Existing GLOF monitoring approaches either prioritize spatial coverage for general models or rely on optical imagery limited by cloud coverage, creating a need for targeted, automated monitoring of high-risk glacial lakes.

Method: End-to-end automated deep learning pipeline using Sentinel-1 SAR time-series with “temporal-first” training strategy on U-Net with EfficientNet-B3 backbone, trained on curated dataset of 4 Himalayan lakes, plus operational Docker architecture with ASF Search API and RESTful endpoint.

Result: Model achieves IoU of 0.9130, validating the success of the temporal-first strategy, and provides a scalable operational system for automated early warning.

Conclusion: The system shifts paradigm from static mapping to dynamic automated early warning, providing scalable architectural foundation for future Early Warning Systems development.

Abstract: Glacial Lake Outburst Floods (GLOFs) are one of the most devastating climate change induced hazards. Existing remote monitoring approaches often prioritise maximising spatial coverage to train generalistic models or rely on optical imagery hampered by persistent cloud coverage. This paper presents an end-to-end, automated deep learning pipeline for the targeted monitoring of high-risk Himalayan glacial lakes using time-series Sentinel-1 SAR. We introduce a “temporal-first” training strategy, utilising a U-Net with an EfficientNet-B3 backbone trained on a curated dataset of a cohort of 4 lakes (Tsho Rolpa, Chamlang Tsho, Tilicho and Gokyo Lake). The model achieves an IoU of 0.9130 validating the success and efficacy of the “temporal-first” strategy required for transitioning to Early Warning Systems. Beyond the model, we propose an operational engineering architecture: a Dockerised pipeline that automates data ingestion via the ASF Search API and exposes inference results via a RESTful endpoint. This system shifts the paradigm from static mapping to dynamic and automated early warning, providing a scalable architectural foundation for future development in Early Warning Systems.

[514] The OCR-PT-CT Project: Semi-Automatic Recognition of Ancient Egyptian Hieroglyphs Based on Metric Learning

David Fuentes-Jimenez, Daniel Pizarro, Álvaro Hernández, Adin Bartoli, César Guerra Méndez, Laura de Diego-Otón, Sira Palazuelos-Cagigas, Carlos Gracia Zamacona

Main category: eess.IV

TL;DR: OCR-PT-CT project develops hieroglyph recognition using Deep Metric Learning with 97.70% accuracy, outperforming Mobilenet (93.87%) for Coffin Texts and Pyramid Texts, integrated with MORTEXVAR dataset.

Details

Motivation: Digital humanities are transforming Egyptology, requiring automated recognition of hieroglyphs from ancient Egyptian texts (Coffin Texts and Pyramid Texts) to enable systematic study and integration with existing datasets.

Method: Two approaches: 1) Mobilenet neural network trained on 140 hieroglyph classes; 2) Novel Deep Metric Learning approach for better flexibility with new or data-limited signs. System identifies hieroglyphs from images and transcribes to Gardiner’s codes.

Result: Mobilenet achieved 93.87% accuracy but struggled with underrepresented classes. Deep Metric Learning achieved 97.70% accuracy, recognized more hieroglyphs, and performed better under class imbalance. Final system adopts Deep Metric Learning as default classifier.

Conclusion: Deep Metric Learning is superior for hieroglyph recognition due to higher accuracy, better handling of class imbalance, and adaptability to new signs. System integrates with MORTEXVAR dataset via CSV format for digital humanities research.

Abstract: Digital humanities are significantly transforming how Egyptologists study ancient Egyptian texts. The OCR-PT-CT project proposes a recognition method for hieroglyphs based on images of Coffin Texts (CT) from Adriaan de Buck (1935-1961) and Pyramid Texts (PT) from Middle Kingdom coffins (James Allen, 2006). The system identifies hieroglyphs and transcribes them into Gardiner’s codes. A web tool organizes them by spells and witnesses, storing the data in CSV format for integration with the MORTEXVAR dataset, which collects Coffin Texts with metadata, transliterations, and translations for research. Recognition has been addressed in two ways: a Mobilenet neural network trained on 140 hieroglyph classes achieved 93.87 % accuracy but struggled with underrepresented classes. A novel Deep Metric Learning approach improves flexibility for new or data-limited signs, achieving 97.70 % accuracy and recognizing more hieroglyphs. Due to its superior performance under class imbalance and adaptability, the final system adopts Deep Metric Learning as the default classifier.

[515] Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission

Xiangyu Chen, Jixiang Luo, Jingyu Xu, Fangqiu Yi, Chi Zhang, Xuelong Li

Main category: eess.IV

TL;DR: GVC achieves extreme video compression rates as low as 0.02% using generative models, trading computation for bandwidth by shifting reconstruction to receivers with generative priors.

Details

Motivation: To achieve extreme video compression rates (as low as 0.01%) for bandwidth-constrained environments like emergency rescue, remote surveillance, and mobile edge computing, while maintaining perception-centric communication.

Method: Generative Video Compression (GVC) framework that encodes video into extremely compact representations and delegates content reconstruction to receivers using powerful generative priors. Includes compression-computation trade-off strategy for practical deployment on consumer-grade GPUs.

Result: Achieves compression rates as low as 0.02% in some cases, demonstrating viability for extreme compression while maintaining video quality through generative synthesis.

Conclusion: GVC offers a new effective, efficient, scalable, and practical video communication paradigm for bandwidth-constrained environments by shifting the burden from transmission to inference.

Abstract: Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of video compression by leveraging modern generative video models to achieve extreme compression rates while preserving a perception-centric, task-oriented communication paradigm, corresponding to Level C of the Shannon-Weaver model. Besides, How we trade computation for compression rate or bandwidth? GVC answers this question by shifting the burden from transmission to inference: it encodes video into extremely compact representations and delegates content reconstruction to the receiver, where powerful generative priors synthesize high-quality video from minimal transmitted information. Is GVC practical and deployable? To ensure practical deployment, we propose a compression-computation trade-off strategy, enabling fast inference on consume-grade GPUs. Within the AI Flow framework, GVC opens new possibility for video communication in bandwidth- and resource-constrained environments such as emergency rescue, remote surveillance, and mobile edge computing. Through empirical validation, we demonstrate that GVC offers a viable path toward a new effective, efficient, scalable, and practical video communication paradigm.

[516] Automated Classification of First-Trimester Fetal Heart Views Using Ultrasound-Specific Self-Supervised Learning

Youssef Megahed, Aylin Erman, Robin Ducharme, Mark C. Walker, Steven Hawken, Adrian D. C. Chan

Main category: eess.IV

TL;DR: Self-supervised ultrasound foundation model (USF-MAE) outperforms supervised baselines for first-trimester fetal heart view classification, achieving 90.57% accuracy without aggressive preprocessing.

Details

Motivation: Early detection of congenital heart disease via first-trimester fetal echocardiography is challenging due to small cardiac structures, low signal-to-noise ratio, and inter-operator variability. Automated analysis at this stage is difficult, creating a need for robust classification methods.

Method: USF-MAE (ultrasound foundation model) pretrained using masked autoencoding on 370,000+ unlabelled ultrasound images across 40+ anatomical regions, then fine-tuned for classification. Evaluated on 6,720 first-trimester fetal echocardiography images to classify five categories (aorta, atrioventricular flows, V sign, X sign, Other). Compared against supervised CNN baselines (ResNet-18, ResNet-50) and ImageNet-pretrained ViT-B/16.

Result: USF-MAE achieved best performance: 90.57% accuracy, 91.15% precision, 90.57% recall, 90.71% F1-score. Outperformed strongest baseline (ResNet-18) by +2.03% accuracy and +1.98% F1-score. Demonstrated robust performance without aggressive preprocessing or ROI cropping, with improved discrimination of non-diagnostic frames.

Conclusion: Self-supervised ultrasound foundation models pretrained on large-scale unlabelled ultrasound data can effectively address challenges in first-trimester fetal heart view classification, outperforming supervised models pretrained on natural images and offering potential for improved early detection of congenital heart disease.

Abstract: Congenital heart disease remains the most common congenital anomaly and a leading cause of neonatal morbidity and mortality. Although first-trimester fetal echocardiography offers an opportunity for earlier detection, automated analysis at this stage is challenging due to small cardiac structures, low signal-to-noise ratio, and substantial inter-operator variability. In this work, we evaluate a self-supervised ultrasound foundation model, USF-MAE, for first-trimester fetal heart view classification. USF-MAE is pretrained using masked autoencoding modelling on more than 370,000 unlabelled ultrasound images spanning over 40 anatomical regions and is subsequently fine-tuned for downstream classification. As a proof of concept, the pretrained Vision Transformer encoder was fine-tuned on an open-source dataset of 6,720 first-trimester fetal echocardiography images to classify five categories: aorta, atrioventricular flows, V sign, X sign, and Other. Model performance was benchmarked against supervised convolutional neural network baselines (ResNet-18 and ResNet-50) and a Vision Transformer (ViT-B/16) model pretrained on natural images (ImageNet-1k). All models were trained and evaluated using identical preprocessing, data splits, and optimization protocols. On an independent test set, USF-MAE achieved the highest performance across all evaluation metrics, with 90.57% accuracy, 91.15% precision, 90.57% recall, and 90.71% F1-score. This represents an improvement of +2.03% in accuracy and +1.98% in F1-score compared with the strongest baseline, ResNet-18. The proposed approach demonstrated robust performance without reliance on aggressive image preprocessing or region-of-interest cropping and showed improved discrimination of non-diagnostic frames.

[517] An Adaptive, Disentangled Representation for Multidimensional MRI Reconstruction

Ruiyang Zhao, Fan Lam

Main category: eess.IV

TL;DR: A novel learned feature-based representation for multidimensional MRI reconstruction that disentangles geometry and contrast into separate latent spaces, enabling better feature correlation exploitation and incorporation of pre-learned priors without task-specific training.

Details

Motivation: To address the challenge of reconstructing multidimensional MRI data when only limited task-specific training data is available, by developing a representation that can better exploit feature correlations and incorporate pre-learned priors.

Method: Uses encoder-decoder network with style-based decoder design and image transfer training on large public data to disentangle features (geometry vs contrast) into distinct latent spaces. Introduces latent diffusion model for stronger feature space constraints. Develops reconstruction formulations with zero-shot self-supervised learning adaptation and subspace modeling.

Result: Achieved improved performance over state-of-the-art reconstruction methods on accelerated T1 and T2 parameter mapping, without requiring task-specific supervised training or fine-tuning.

Conclusion: The method offers a new strategy for learning-based multidimensional image reconstruction when limited data is available for problem-specific training, demonstrating the value of disentangled feature representations and pre-learned priors.

Abstract: We present a new approach for representing and reconstructing multidimensional magnetic resonance imaging (MRI) data. Our method builds on a novel, learned feature-based image representation that disentangles different types of features, such as geometry and contrast, into distinct low-dimensional latent spaces, enabling better exploitation of feature correlations in multidimensional images and incorporation of pre-learned priors specific to different feature types for reconstruction. More specifically, the disentanglement was achieved via an encoderdecoder network and image transfer training using large public data, enhanced by a style-based decoder design. A latent diffusion model was introduced to impose stronger constraints on distinct feature spaces. New reconstruction formulations and algorithms were developed to integrate the learned representation with a zero-shot selfsupervised learning adaptation and subspace modeling. The proposed method has been evaluated on accelerated T1 and T2 parameter mapping, achieving improved performance over state-of-the-art reconstruction methods, without task-specific supervised training or fine-tuning. This work offers a new strategy for learning-based multidimensional image reconstruction where only limited data are available for problem-specific or task-specific training.

[518] Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

Yudi Sang, Yanzhen Liu, Sutuke Yibulayimu, Yunning Wang, Benjamin D. Killeen, Mingxu Liu, Ping-Cheng Ku, Ole Johannsen, Karol Gotkowski, Maximilian Zenk, Klaus Maier-Hein, Fabian Isensee, Peiyan Yue, Yi Wang, Haidong Yu, Zhaohong Pan, Yutong He, Xiaokun Liang, Daiqi Liu, Fuxin Fan, Artur Jurgas, Andrzej Skalski, Yuxi Ma, Jing Yang, Szymon Płotka, Rafał Litka, Gang Zhu, Yingchun Song, Mathias Unberath, Mehran Armand, Dan Ruan, S. Kevin Zhou, Qiyong Cao, Chunpeng Zhao, Xinbao Wu, Yu Wang

Main category: eess.IV

TL;DR: The PENGWIN challenge benchmarked pelvic fracture segmentation algorithms on CT and X-ray images, with top CT results achieving 0.930 IoU but X-ray results (0.774 IoU) insufficient for clinical use, revealing challenges in fragment definition and overlap.

Details

Motivation: Pelvic fracture segmentation in CT and X-ray images is crucial for trauma diagnosis and surgical planning, but remains challenging due to complex anatomy and imaging limitations. The PENGWIN challenge aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms.

Method: Organized as a MICCAI 2024 satellite event, the challenge used a diverse dataset of 150 CT scans from multiple clinical centers and simulated X-ray images generated using DeepDRR. 16 teams submitted algorithms evaluated under rigorous multi-metric testing, revealing methodological diversity in instance representation approaches.

Result: Top CT algorithm achieved 0.930 average fragment-wise IoU (satisfactory accuracy), while best X-ray algorithm achieved 0.774 IoU (promising but insufficient for intra-operative use). The challenge exposed uncertainties in fragment definition, especially for incomplete fractures, and showed varying segmentation strategies based on instance representation approaches.

Conclusion: While CT segmentation shows satisfactory accuracy, X-ray segmentation remains challenging due to fragment overlap in projection imaging. Interactive segmentation approaches integrating human decision-making with task-relevant information may be essential for improving model reliability and clinical applicability.

Abstract: The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm achieved an IoU of 0.774, which is promising but not yet sufficient for intra-operative decision-making, reflecting the inherent challenges of fragment overlap in projection imaging. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.

[519] Hybrid Learning: A Novel Combination of Self-Supervised and Supervised Learning for Joint MRI Reconstruction and Denoising in Low-Field MRI

Haoyang Pei, Nikola Janjuvsevic, Renqing Luo, Ding Xia, Xiang Xu, William Moore, Yao Wang, Hersh Chandarana, Li Feng

Main category: eess.IV

TL;DR: Hybrid learning combines self-supervised and supervised learning for MRI reconstruction when only low-SNR training data is available, improving image quality over existing methods.

Details

Motivation: Conventional supervised MRI reconstruction requires high-quality references that are often unavailable, especially in low-field MRI. Self-supervised learning doesn't need references but performs poorly with low baseline SNR.

Method: Two-stage framework: 1) Self-supervised learning on fully sampled low-SNR data to generate pseudo-references, 2) Supervised learning using pseudo-references as targets to reconstruct and denoise undersampled noisy data.

Result: Hybrid learning consistently outperformed both standard self-supervised learning and supervised learning with noisy references across different acceleration rates, noise levels, and field strengths, achieving higher SSIM and lower NMSE.

Conclusion: Hybrid learning provides an effective solution for training deep MRI reconstruction models without high-SNR references, enabling better image quality in low-SNR settings and supporting broader clinical adoption of deep learning-based reconstruction.

Abstract: Deep learning has demonstrated strong potential for MRI reconstruction. However, conventional supervised learning requires high-quality, high-SNR references for network training, which are often difficult or impossible to obtain in different scenarios, particularly in low-field MRI. Self-supervised learning provides an alternative by removing the need for training references, but its reconstruction performance can degrade when the baseline SNR is low. To address these limitations, we propose hybrid learning, a two-stage training framework that integrates self-supervised and supervised learning for joint MRI reconstruction and denoising when only low-SNR training references are available. Hybrid learning is implemented in two sequential stages. In the first stage, self-supervised learning is applied to fully sampled low-SNR data to generate higher-quality pseudo-references. In the second stage, these pseudo-references are used as targets for supervised learning to reconstruct and denoise undersampled noisy data. The proposed technique was evaluated in multiple experiments involving simulated and real low-field MRI in the lung and brain at different field strengths. Hybrid learning consistently improved image quality over both standard self-supervised learning and supervised learning with noisy training references at different acceleration rates, noise levels, and field strengths, achieving higher SSIM and lower NMSE. The hybrid learning approach is effective for both Cartesian and non-Cartesian acquisitions. Hybrid learning provides an effective solution for training deep MRI reconstruction models in the absence of high-SNR references. By improving image quality in low-SNR settings, particularly for low-field MRI, it holds promise for broader clinical adoption of deep learning-based reconstruction methods.

[520] GroundGazer: Camera-based indoor localization of mobile robots with millimeter accuracy at low cost

Sven Hinderer, Jakob Hüsken, Bohan Sun, Bin Yang

Main category: eess.IV

TL;DR: GroundGazer: Low-cost mm-accurate indoor localization using monocular camera and chessboard floor for autonomous mobile robots

Details

Motivation: Existing high-accuracy indoor localization systems (LiDAR, tachymeters, motion capture) are very expensive. Need affordable alternative with mm positioning accuracy for autonomous mobile robots.

Method: Uses only monocular (fisheye) camera, chessboard floor pattern, and optional laser diode. Estimates robot position with mm accuracy and heading with sub-degree accuracy through visual pattern recognition.

Result: Achieves mm-level positioning accuracy and sub-degree heading accuracy. System is simple, low-cost, easy to set up, portable, robust, scalable to large areas and robot swarms.

Conclusion: GroundGazer provides high-accuracy indoor localization at low cost, with potential for 3D position and orientation estimation extension.

Abstract: Highly accurate indoor localization systems with mm positioning accuracy are currently very expensive. They include range finders (such as LiDAR), tachymeters, and motion capture systems relying on multiple high-end cameras. In this work, we introduce a high-accuracy, planar indoor localization system named GroundGazer (GG) for autonomous mobile robots (AMRs). GG estimates the AMR’s position with mm and its heading with sub-degree accuracy. The system requires only a monocular (fisheye) camera, a chessboard floor, and an optional laser diode. Our system is simple and low-cost, easy to set up, portable, robust, scalable to large areas and robot swarms, and potentially extendable to 3D position and orientation estimation.

[521] Single-View Tomographic Reconstruction Using Learned Primal Dual

Sean Breckling, Matthew Swan, Keith D. Tan, Derek Wingard, Brandon Baldonado, Yoohwan Kim, Ju-Yeon Jo, Evan Scott, Jordan Pillow

Main category: eess.IV

TL;DR: LPD method tested for single-view tomographic reconstruction of axially-symmetric targets in two X-ray modalities, outperforming traditional numerical inversion methods.

Details

Motivation: To investigate the performance of Learned Primal Dual (LPD) method in extreme single-view tomographic reconstruction scenarios, where traditional methods face significant challenges due to limited data acquisition.

Method: Study considers two X-ray modalities: low-divergence/parallel X-rays and cone-beam imaging. Training data generated using closed-form integral transforms or physics-based ray-tracing software, then corrupted with blur and noise. Performance compared against common numerical inversion methodologies.

Result: LPD method shows promising results for single-view tomographic reconstructions of axially-symmetric targets, demonstrating effectiveness even in extreme data-limited scenarios.

Conclusion: LPD method is effective for challenging single-view tomographic reconstruction tasks and outperforms traditional numerical inversion methods in extreme acquisition-limited scenarios.

Abstract: The Learned Primal Dual (LPD) method has shown promising results in various tomographic reconstruction modalities, particularly under challenging acquisition restrictions such as limited viewing angles or a limited number of views. We investigate the performance of LPD in a more extreme case: single-view tomographic reconstructions of axially-symmetric targets. This study considers two modalities: the first assumes low-divergence or parallel X-rays. The second models a cone-beam X-ray imaging testbed. For both modalities, training data is generated using closed-form integral transforms, or physics-based ray-tracing software, then corrupted with blur and noise. Our results are then compared against common numerical inversion methodologies.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

[2] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

[3] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

[4] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

[5] MiMo-Audio: Audio Language Models are Few-Shot Learners

[6] PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

[7] Noise-Driven Persona Formation in Reflexive Neural Language Generation

[8] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

[9] Emergent World Beliefs: Exploring Transformers in Stochastic Games

[10] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

[11] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

[12] Break Out the Silverware – Semantic Understanding of Stored Household Items

[13] Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

[14] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

[15] Fun-Audio-Chat Technical Report

[16] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

[17] Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

[18] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

[19] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

[20] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

[21] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

[22] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

[23] Efficient Context Scaling with LongCat ZigZag Attention

[24] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

[25] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

[26] WISE: Web Information Satire and Fakeness Evaluation

[27] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

[28] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

[29] HY-MT1.5 Technical Report

[30] Training a Huggingface Model on AWS Sagemaker (Without Tears)

[31] Activation Steering for Masked Diffusion Language Models

[32] Large Emotional World Model

[33] Training Report of TeleChat3-MoE

[34] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

[35] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

[36] Tracing the Flow of Knowledge From Science to Technology Using Deep Learning

[37] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

[38] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

[39] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

[40] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

[41] World model inspired sarcasm reasoning with large language model agents

[42] Skim-Aware Contrastive Learning for Efficient Document Representation

[43] Comparing Approaches to Automatic Summarization in Less-Resourced Languages

[44] Cleaning English Abstracts of Scientific Publications

[45] IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

[46] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

[47] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

[48] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

[49] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities

[50] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

[51] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

[52] Do Large Language Models Know What They Are Capable Of?

[53] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

[54] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

[55] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

[56] Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

[57] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

[58] Practising responsibility: Ethics in NLP as a hands-on course

[59] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

[60] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

[61] Big AI is accelerating the metacrisis: What can we do?

[62] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

[63] mHC: Manifold-Constrained Hyper-Connections

[64] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

[65] Classifying long legal documents using short random chunks

[66] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

[67] Modeling Language as a Sequence of Thoughts

[68] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

[69] CascadeNS: Confidence-Cascaded Neurosymbolic Model for Sarcasm Detection

[70] LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models

[71] Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering

[72] Automatic identification of diagnosis from hospital discharge letters via weakly-supervised Natural Language Processing

[73] Bielik 7B v0.1: A Polish Language Model – Development, Insights, and Evaluation

[74] Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

[75] Quantifying Positional Biases in Text Embedding Models

[76] Large Multimodal Models for Low-Resource Languages: A Survey

[77] ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting