Daily arXiv Papers - 2025-12-29

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Teaching People LLM’s Errors and Getting it Right

Nathan Stringham, Fateme Hashemi Chaleshtori, Xinyuan Yan, Zhichao Xu, Bei Wang, Ana Marasović

Main category: cs.CL

TL;DR: Teaching LLM failure patterns to users could reduce overreliance, but current automated discovery methods are unreliable and evaluation metrics need improvement.

DetailsMotivation: People overrely on LLMs because they see impressive capabilities but don't realize LLMs can fail on basic tasks. Previous attempts to teach failure patterns haven't fully succeeded, and this paper aims to understand why.

Method: 1) Examined whether failure patterns exist by grouping instances by meta-labels and evaluating LLM performance; 2) Tested if prompting and embedding-based methods can surface known failures; 3) Proposed new metric to assess users’ ability to anticipate LLM failures; 4) Conducted user study with the new metric.

Result: 1) Failure patterns do exist (sizable meta-label groups where LLMs are error-prone); 2) Mixed results for automated discovery methods - prompting and embedding approaches inconsistently surface failures; 3) User study shows positive effect of teaching with the new metric (ability to anticipate failures), unlike traditional human-AI team accuracy.

Conclusion: Teaching failure patterns could mitigate overreliance, but success requires better automated failure-discovery methods and using appropriate evaluation metrics (like ability to anticipate failures) rather than just team accuracy.

Abstract: People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won’t stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why. We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM’s predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM’s failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user’s ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.

[2] Morality is Contextual: Learning Interpretable Moral Contexts from Human Data with Probabilistic Clustering and Large Language Models

Geoffroy Morlat, Marceau Nahon, Augustin Chartouny, Raja Chatila, Ismael T. Freire, Mehdi Khamassi

Main category: cs.CL

TL;DR: COMETH framework integrates probabilistic context learning with LLM semantics and human moral evaluations to model how context shapes acceptability of ambiguous actions, achieving ~60% alignment with human judgments vs ~30% for end-to-end LLMs.

DetailsMotivation: Moral judgments depend heavily on context, not just outcomes. Current approaches lack systematic modeling of how contextual factors shape moral evaluations of ambiguous actions. The paper aims to create a framework that better captures context-sensitive moral reasoning.

Method: 1) Curated dataset of 300 scenarios across six core moral actions (violating prohibitions against killing, deceiving, breaking laws) with ternary judgments from 101 participants. 2) Preprocessing pipeline standardizes actions via LLM filter and MiniLM embeddings with K-means clustering. 3) COMETH learns action-specific moral contexts by clustering scenarios from human judgment distributions using divergence criteria. 4) Generalization module extracts binary contextual features and learns feature weights in transparent likelihood-based model.

Result: COMETH achieves approximately 60% alignment with majority human judgments, roughly doubling the performance of end-to-end LLM prompting (approx. 30%). The framework also provides interpretable explanations by revealing which contextual features drive predictions.

Conclusion: COMETH provides an interpretable, empirically grounded alternative to end-to-end LLMs for context-sensitive moral prediction. It combines human judgments with model-based context learning and LLM semantics to better capture how context shapes moral evaluations while maintaining transparency.

Abstract: Moral actions are judged not only by their outcomes but by the context in which they occur. We present COMETH (Contextual Organization of Moral Evaluation from Textual Human inputs), a framework that integrates a probabilistic context learner with LLM-based semantic abstraction and human moral evaluations to model how context shapes the acceptability of ambiguous actions. We curate an empirically grounded dataset of 300 scenarios across six core actions (violating Do not kill, Do not deceive, and Do not break the law) and collect ternary judgments (Blame/Neutral/Support) from N=101 participants. A preprocessing pipeline standardizes actions via an LLM filter and MiniLM embeddings with K-means, producing robust, reproducible core-action clusters. COMETH then learns action-specific moral contexts by clustering scenarios online from human judgment distributions using principled divergence criteria. To generalize and explain predictions, a Generalization module extracts concise, non-evaluative binary contextual features and learns feature weights in a transparent likelihood-based model. Empirically, COMETH roughly doubles alignment with majority human judgments relative to end-to-end LLM prompting (approx. 60% vs. approx. 30% on average), while revealing which contextual features drive its predictions. The contributions are: (i) an empirically grounded moral-context dataset, (ii) a reproducible pipeline combining human judgments with model-based context learning and LLM semantics, and (iii) an interpretable alternative to end-to-end LLMs for context-sensitive moral prediction and explanation.

[3] Oogiri-Master: Benchmarking Humor Understanding via Oogiri

Soichiro Murakami, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura

Main category: cs.CL

TL;DR: Researchers introduce Oogiri-Master benchmark and Oogiri-Corpus dataset to rigorously evaluate humor understanding in LLMs using Japanese creative response game, addressing limitations of previous humor datasets.

DetailsMotivation: Existing humor datasets have limitations: few candidate responses per prompt, exposure to popularity signals during ratings, and lack of objective metrics for funniness. Need better tools to understand what makes responses funny to humans.

Method: Created Oogiri-Corpus with ~100 diverse candidate responses per prompt, rated independently by ~100 human judges without seeing others’ ratings. Used this to analyze linguistic factors (text length, ambiguity, incongruity resolution) and derive objective metrics. Then benchmarked LLMs and human baselines on Oogiri-Master.

Result: State-of-the-art LLMs approach human performance in humor understanding. Insight-augmented prompting improves model performance. Quantitative analysis identified linguistic factors associated with funniness and derived objective metrics for predicting human judgments.

Conclusion: The Oogiri-Master benchmark and Oogiri-Corpus dataset provide a principled basis for evaluating and advancing humor understanding in LLMs, addressing previous methodological limitations and enabling more rigorous assessment of creative thinking capabilities.

Abstract: Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others’ ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.

[4] Beyond Heuristics: A Decision-Theoretic Framework for Agent Memory Management

Changzhi Sun, Xiangyu Chen, Jixiang Luo, Dell Zhang, Xuelong Li

Main category: cs.CL

TL;DR: The paper proposes DAM (Decision-theoretic Agent Memory), a framework that reframes memory management in LLMs as a sequential decision-making problem under uncertainty, moving beyond hand-designed heuristics.

DetailsMotivation: Current memory management in LLM systems relies on hand-designed heuristics that don't account for long-term consequences and uncertainty. Memory decisions (what to read/write) affect future retrieval and downstream behavior in unpredictable ways, creating a need for a more principled approach.

Method: Proposes DAM framework that decomposes memory management into immediate information access and hierarchical storage maintenance. Uses value functions and uncertainty estimators to evaluate candidate operations, with an aggregate policy making decisions based on estimated long-term utility and risk.

Result: The paper presents a conceptual framework rather than experimental results. DAM provides a principled reframing that clarifies limitations of heuristic approaches and establishes foundation for future research on uncertainty-aware memory systems.

Conclusion: Memory management should be viewed as a sequential decision-making problem under uncertainty. DAM offers a decision-theoretic foundation for developing more sophisticated, uncertainty-aware memory systems that can better handle delayed utility and future interactions.

Abstract: External memory is a key component of modern large language model (LLM) systems, enabling long-term interaction and personalization. Despite its importance, memory management is still largely driven by hand-designed heuristics, offering little insight into the long-term and uncertain consequences of memory decisions. In practice, choices about what to read or write shape future retrieval and downstream behavior in ways that are difficult to anticipate. We argue that memory management should be viewed as a sequential decision-making problem under uncertainty, where the utility of memory is delayed and dependent on future interactions. To this end, we propose DAM (Decision-theoretic Agent Memory), a decision-theoretic framework that decomposes memory management into immediate information access and hierarchical storage maintenance. Within this architecture, candidate operations are evaluated via value functions and uncertainty estimators, enabling an aggregate policy to arbitrate decisions based on estimated long-term utility and risk. Our contribution is not a new algorithm, but a principled reframing that clarifies the limitations of heuristic approaches and provides a foundation for future research on uncertainty-aware memory systems.

[5] A Unified Definition of Hallucination, Or: It’s the World Model, Stupid

Emmy Liu, Varun Gangal, Chelsea Zou, Xiaoqi Huang, Michael Yu, Alex Chang, Zhuofu Tao, Sachin Kumar, Steven Y. Feng

Main category: cs.CL

TL;DR: The paper provides a unified definition of hallucination as inaccurate internal world modeling that’s observable to users, and proposes synthetic benchmarks to test and improve language models’ world modeling capabilities.

DetailsMotivation: Despite years of research, hallucination remains a persistent problem in even state-of-the-art language models. The authors aim to unify disparate definitions of hallucination in the literature and provide a clear framework for understanding and addressing this fundamental issue.

Method: The authors analyze historical and current definitions of hallucination, folding them into a single unified definition centered on inaccurate internal world modeling. They propose varying reference world models and knowledge conflict policies to understand different existing definitions, and outline plans for synthetic benchmarks with fully specified world models.

Result: The paper presents a unified definition of hallucination as inaccurate world modeling observable to users, which helps clarify what should/shouldn’t be called hallucination, provides common language for comparing benchmarks and mitigation techniques, and enables systematic evaluation of language models’ world modeling capabilities.

Conclusion: A unified view of hallucination as inaccurate world modeling provides clarity for research, enables better benchmark design, and offers a pathway to systematically stress-test and improve language models’ world modeling components through synthetic benchmarks with fully specified environments.

Abstract: Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed “world” or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.

[6] Gamayun’s Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

Alexander Podolskiy, Semen Molokov, Timofey Gerasin, Maksim Titov, Alexey Rukhovich, Artem Khrapov, Kirill Morozov, Evgeny Tetin, Constantine Korikov, Pavel Efimov, Polina Lazukova, Yuliya Skripkar, Nikita Okhotnikov, Irina Piontkovskaya, Meng Xiaojun, Zou Xueyi, Zhang Zhenhe

Main category: cs.CL

TL;DR: Gamayun is a 1.5B-parameter multilingual LLM trained on 2.5T tokens with a novel two-stage approach, outperforming larger models like LLaMA3.2-1B and Qwen2.5-1.5B on English/multilingual tasks, achieving SOTA in Russian.

DetailsMotivation: Addresses the lack of research on small non-English-centric LLMs for resource-constrained environments, focusing on efficient multilingual models with special emphasis on Russian language support.

Method: Two-stage pre-training: 1) balanced multilingual training for cross-lingual alignment, 2) high-quality English enrichment to transfer performance gains across languages. Trained from scratch on 2.5T tokens.

Result: Outperforms LLaMA3.2-1B (9T tokens) on all benchmarks, surpasses Qwen2.5-1.5B (18T tokens) on English/multilingual tasks, matches/exceeds Qwen3 (36T tokens) outside advanced STEM, achieves SOTA in Russian including MERA benchmark.

Conclusion: Gamayun demonstrates that efficient small multilingual models can achieve competitive performance with significantly smaller training budgets through strategic two-stage training, particularly excelling in Russian language tasks.

Abstract: We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

[7] Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: A3PO: Adaptive Asymmetric token-level Advantage shaping for Policy Optimization improves reasoning in large models by better allocating advantage signals to key tokens across positive/negative samples.

DetailsMotivation: Current RLVR training uses both positive and negative self-generated rollouts, but there's limited understanding of how these sample polarities affect training dynamics and behaviors. The paper aims to systematically investigate this and improve RLVR training.

Method: Systematically investigates how positive/negative samples affect RLVR training, finds positive samples sharpen existing correct patterns while negative samples encourage exploration. Proposes A3PO (Adaptive Asymmetric token-level Advantage shaping for Policy Optimization) that precisely allocates advantage signals to key tokens across different polarities.

Result: Experiments across five reasoning benchmarks demonstrate the effectiveness of the A3PO approach.

Conclusion: The proposed A3PO method improves RLVR training by better understanding and leveraging the different effects of positive and negative samples through adaptive, asymmetric token-level advantage shaping.

Abstract: Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

[8] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations

Chengxu Yang, Jingling Yuan, Siqi Cai, Jiawei Jiang, Chuang Hu

Main category: cs.CL

TL;DR: HIC-Bench is a novel evaluation framework that rethinks LLM hallucinations by categorizing them as Intelligent (creative/valuable) vs Defective (erroneous), enabling systematic study of their interplay in scientific innovation.

DetailsMotivation: Current hallucination detection focuses too narrowly on factual consistency, failing to handle heterogeneous scientific tasks and balance creativity with accuracy. There's a need to quantify the creative/epistemically valuable aspects of hallucinations that remain understudied.

Method: Proposes HIC-Bench with three core features: 1) Structured IH/DH assessment using multi-dimensional metrics combining TTCT creativity metrics with hallucination-specific dimensions, 2) Cross-domain applicability across ten scientific domains with open-ended tasks, and 3) Dynamic Prompt Optimization using DHP to guide models toward creative and reliable outputs. Uses multiple LLM judges with human verification.

Result: Reveals a nonlinear relationship between Intelligent and Defective Hallucinations, demonstrating that creativity and correctness can be jointly optimized. Shows IH can act as a catalyst for creativity and that LLM hallucinations can drive scientific innovation.

Conclusion: HIC-Bench provides a valuable platform for advancing research into the creative intelligence of LLM hallucinations, positioning intelligent hallucinations as potential catalysts for scientific innovation rather than just errors to be minimized.

Abstract: Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.

[9] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, Emmanuel Dupoux

Main category: cs.CL

TL;DR: SpidR is a self-supervised speech representation model that learns phonetic-rich representations for textless spoken language modeling, outperforming existing models on language benchmarks while reducing pretraining time from a week to one day.

DetailsMotivation: To enable learning language directly from speech without textual intermediates by extracting semantic representations from speech, addressing the need for efficient textless spoken language modeling.

Method: Self-supervised model trained on raw waveforms using masked prediction objective with self-distillation and online clustering; intermediate student layers predict teacher layer assignments, stabilizing clustering for higher-quality codebooks.

Result: Outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on language modeling benchmarks (sWUGGY, sBLIMP, tSC); reduces pretraining time from a week to one day on 16 GPUs; validates speech unit quality metrics as reliable proxies for language modeling performance.

Conclusion: SpidR provides efficient, high-quality speech representations for textless language modeling with significantly reduced training time, enabling faster experimentation and advancing speech-based language learning.

Abstract: The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher’s intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.

[10] Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech

Shuchang Pan, Siddharth Banerjee, Dhruv Hebbar, Siddhant Patel, Akshaj Gupta, Kan Jen Cheng, Hanjo Kim, Zeyi Austin Li, Martin Q. Ma, Tingle Li, Gopala Anumanchipalli, Jiachen Lian

Main category: cs.CL

TL;DR: A Graph-of-Thoughts framework models conversational reasoning as causal inference, predicting communicative intents and speech acts with interpretable justifications for full-duplex dialogue systems.

DetailsMotivation: Human conversation follows implicit chains of thoughts manifested as timed speech acts. Capturing this causal pathway is essential for building natural full-duplex interactive systems that can reason about conversational behaviors.

Method: Formalizes intent-to-action pathway with hierarchical labeling scheme (high-level communicative intents + low-level speech acts), uses Graph-of-Thoughts for causal inference, trains on hybrid corpus of controllable simulations with human annotations and real speech, structures streaming predictions as evolving graph with multimodal transformer.

Result: Framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems, validated on both synthetic and real duplex dialogues.

Conclusion: The Graph-of-Thoughts framework successfully models conversational reasoning as causal inference, enabling interpretable predictions of speech acts with justifications, advancing full-duplex dialogue systems.

Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

[11] MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

Jing Han, Binwei Yan, Tianyu Guo, Zheyuan Bai, Mengyu Zheng, Hanting Chen, Ying Nie

Main category: cs.CL

TL;DR: MoR-Agent: A parameter-efficient fine-tuning framework for LLM agents using Mixture-of-Roles with specialized LoRA groups for reasoner, executor, and summarizer roles.

DetailsMotivation: Current PEFT methods for LLM agents are underexplored despite recent advances in fine-tuning LLMs for agent tasks. There's a need for parameter-efficient approaches that can handle the complex reasoning and execution requirements of agent systems.

Method: 1) Decompose agent capabilities into three roles: reasoner (comprehends queries, determines next role), executor (identifies functions/parameters to invoke), and summarizer (distills conversation info). 2) Propose Mixture-of-Roles (MoR) framework with three specialized LoRA groups, each for a distinct role. 3) Develop multi-role data generation pipeline with role-specific content completion and reliability verification for effective fine-tuning.

Result: Extensive experiments and ablation studies on various LLMs and agent benchmarks demonstrate the effectiveness of the proposed method. The framework shows promising results in parameter-efficient agent fine-tuning.

Conclusion: MoR-Agent provides an effective parameter-efficient fine-tuning approach for LLM agents by decomposing agent capabilities into specialized roles and using targeted LoRA groups, with publicly available implementation.

Abstract: Despite recent advancements of fine-tuning large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant Reason+Action paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user’s query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification. We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method. This project is publicly available at https://mor-agent.github.io.

[12] Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers

Md. Rakibul Islam, Most. Sharmin Sultana Samu, Md. Zahid Hossain, Farhad Uz Zaman, Md. Kamrozzaman Bhuiyan

Main category: cs.CL

TL;DR: This paper investigates AI-generated text detection in Bengali using transformer models, finding that fine-tuned XLM-RoBERTa, mDeBERTa and MultilingualBERT achieve ~91% accuracy, while zero-shot approaches perform at chance levels.

DetailsMotivation: Bengali language detection of AI-generated text remains unexplored despite growing concerns about LLM misuse for disinformation and content manipulation. Bengali's rich vocabulary and complex structure make detection particularly challenging compared to other languages.

Method: The study evaluates five transformer-based models (XLMRoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base, MultilingualBERT-Base) using both zero-shot evaluation and task-specific fine-tuning for Bengali AI-generated text detection.

Result: Zero-shot evaluation shows all models perform near chance levels (~50% accuracy). Fine-tuning significantly improves performance: XLM-RoBERTa, mDeBERTa and MultilingualBERT achieve ~91% accuracy and F1-score. IndicBERT shows weaker performance, suggesting limited effectiveness for this task.

Conclusion: This work advances AI-generated text detection in Bengali, establishing a foundation for robust systems to counter AI-generated content. Fine-tuning is essential for effective detection, with XLM-RoBERTa, mDeBERTa and MultilingualBERT being the most promising models.

Abstract: Large language models (LLMs) can produce text that closely resembles human writing. This capability raises concerns about misuse, including disinformation and content manipulation. Detecting AI-generated text is essential to maintain authenticity and prevent malicious applications. Existing research has addressed detection in multiple languages, but the Bengali language remains largely unexplored. Bengali’s rich vocabulary and complex structure make distinguishing human-written and AI-generated text particularly challenging. This study investigates five transformer-based models: XLMRoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Zero-shot evaluation shows that all models perform near chance levels (around 50% accuracy) and highlight the need for task-specific fine-tuning. Fine-tuning significantly improves performance, with XLM-RoBERTa, mDeBERTa and MultilingualBERT achieving around 91% on both accuracy and F1-score. IndicBERT demonstrates comparatively weaker performance, indicating limited effectiveness in fine-tuning for this task. This work advances AI-generated text detection in Bengali and establishes a foundation for building robust systems to counter AI-generated content.

[13] Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought

Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu

Main category: cs.CL

TL;DR: COCONUT (Chain-of-Continuous-Thought) uses latent tokens that function as uninterpretable placeholders rather than encoding faithful reasoning, promoting shortcut usage over genuine reasoning while appearing resistant to perturbation.

DetailsMotivation: To investigate the internal mechanisms of latent tokens in LLMs from a reliability perspective, examining whether they truly encode reasoning processes or function as shortcuts that conceal reasoning deficiencies.

Method: Two complementary approaches: 1) Steering experiments perturbing specific token subsets (COCONUT vs explicit CoT) to test sensitivity and information content; 2) Shortcut experiments evaluating models under biased and out-of-distribution settings on MMLU and HotpotQA benchmarks.

Result: COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information compared to explicit CoT tokens. COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning.

Conclusion: COCONUT functions as a pseudo-reasoning mechanism that generates plausible traces concealing shortcut dependence rather than faithfully representing reasoning processes, revealing fundamental weaknesses in latent token approaches.

Abstract: Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.

[14] CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

Rui Ke, Jiahui Xu, Shenghao Yang, Kuang Wang, Feng Jiang, Haizhou Li

Main category: cs.CL

TL;DR: CATCH is a unified framework for theme detection in dialogues that integrates context-aware topic representation, preference-guided clustering, and hierarchical theme generation to address challenges of sparse utterances and user-level preferences.

DetailsMotivation: Theme detection in user-centric dialogue systems faces challenges: sparse/short utterances make accurate topic representation difficult, and existing methods fail to capture user-level thematic preferences across dialogues while maintaining cross-dialogue consistency.

Method: CATCH framework with three core components: (1) context-aware topic representation using surrounding topic segments, (2) preference-guided topic clustering that jointly models semantic proximity and personalized feedback, and (3) hierarchical theme generation mechanism to suppress noise and produce coherent labels.

Result: Experiments on DSTC-12 multi-domain customer dialogue benchmark demonstrate CATCH’s effectiveness with 8B LLM in both theme clustering and topic generation quality.

Conclusion: CATCH addresses key challenges in theme detection by integrating contextualized clustering and hierarchical generation, providing a unified solution for accurate, user-aligned topic identification in dialogue systems.

Abstract: Theme detection is a fundamental task in user-centric dialogue systems, aiming to identify the latent topic of each utterance without relying on predefined schemas. Unlike intent induction, which operates within fixed label spaces, theme detection requires cross-dialogue consistency and alignment with personalized user preferences, posing significant challenges. Existing methods often struggle with sparse, short utterances for accurate topic representation and fail to capture user-level thematic preferences across dialogues. To address these challenges, we propose CATCH (Controllable Theme Detection with Contextualized Clustering and Hierarchical Generation), a unified framework that integrates three core components: (1) context-aware topic representation, which enriches utterance-level semantics using surrounding topic segments; (2) preference-guided topic clustering, which jointly models semantic proximity and personalized feedback to align themes across dialogue; and (3) a hierarchical theme generation mechanism designed to suppress noise and produce robust, coherent topic labels. Experiments on a multi-domain customer dialogue benchmark (DSTC-12) demonstrate the effectiveness of CATCH with 8B LLM in both theme clustering and topic generation quality.

[15] Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

Abdullah Alabdullah, Lifeng Han, Chenghua Lin

Main category: cs.CL

TL;DR: Ara-HOPE is a human-centric post-editing evaluation framework for Dialectal Arabic to Modern Standard Arabic translation that addresses limitations of existing metrics through a systematic error taxonomy and annotation protocol.

DetailsMotivation: Existing automatic evaluation metrics and general-purpose human evaluation frameworks fail to capture dialect-specific MT errors in DA-MSA translation, hindering progress in translation assessment and system improvement.

Method: Developed Ara-HOPE framework with a five-category error taxonomy and decision-tree annotation protocol for systematic human evaluation, then applied it to compare three MT systems: Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200.

Result: Ara-HOPE effectively highlighted systematic performance differences between MT systems, revealing that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation.

Conclusion: Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems, addressing a critical gap in Arabic MT evaluation.

Abstract: Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. The results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems.

[16] Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning

Ting-Hao K. Huang, Ryan A. Rossi, Sungchul Kim, Tong Yu, Ting-Yao E. Hsu, Ho Yin, Ng, C. Lee Giles

Main category: cs.CL

TL;DR: The SciCap project (2021-2025) evolved from a Penn State seed-funded idea into a major scientific figure-captioning initiative, involving data curation, evaluations, LLM adaptation, challenges, and interactive captioning tools, with lessons learned and future challenges outlined.

DetailsMotivation: To test whether domain-specific training (successful in text models like SciBERT) could also work for figure captions, and to improve scientific figure captioning through systematic research and tool development.

Method: Curated and released large collections of figure-caption pairs from arXiv papers; conducted extensive automatic and human evaluations on generated and author-written captions; adapted to large language models (LLMs); launched annual challenges; built interactive captioning systems.

Result: The project grew into a central effort shaping the scientific figure-captioning landscape, involving multi-institution collaboration and producing valuable datasets, evaluation frameworks, and practical tools for scientists.

Conclusion: The paper summarizes key technical and methodological lessons from the first five years of SciCap and outlines five major unsolved challenges with proposed directions for future research in scientific figure captioning.

Abstract: Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.

[17] On The Conceptualization and Societal Impact of Cross-Cultural Bias

Vitthal Bhandari

Main category: cs.CL

TL;DR: This paper analyzes 20 recent (2025) papers on cultural bias in NLP, identifying key observations to help researchers better conceptualize and evaluate bias, advocating for more robust assessment of societal impacts from cross-cultural bias in language technologies.

DetailsMotivation: Current research on cultural bias in LLMs tends to generalize across cultures and often fails to engage with real-world stakeholders, which undermines addressing the fundamental problem. There's a need for better conceptualization and evaluation of cultural bias in NLP systems.

Method: Conducted literature review of 20 papers published in 2025 about cultural bias in NLP, inspired by prior work (arXiv:2005.14050v2). Analyzed these papers to extract key observations and patterns for better bias conceptualization and evaluation.

Result: Developed a set of observations from the analyzed literature that can help NLP researchers concretely conceptualize cultural bias and effectively evaluate its harms in language technologies.

Conclusion: Advocates for more robust assessment of societal impact of language technologies exhibiting cross-cultural bias, emphasizing the importance of stakeholder engagement and concrete bias conceptualization for meaningful evaluation.

Abstract: Research has shown that while large language models (LLMs) can generate their responses based on cultural context, they are not perfect and tend to generalize across cultures. However, when evaluating the cultural bias of a language technology on any dataset, researchers may choose not to engage with stakeholders actually using that technology in real life, which evades the very fundamental problem they set out to address. Inspired by the work done by arXiv:2005.14050v2, I set out to analyse recent literature about identifying and evaluating cultural bias in Natural Language Processing (NLP). I picked out 20 papers published in 2025 about cultural bias and came up with a set of observations to allow NLP researchers in the future to conceptualize bias concretely and evaluate its harms effectively. My aim is to advocate for a robust assessment of the societal impact of language technologies exhibiting cross-cultural bias.

[18] Method Decoration (DeMe): A Framework for LLM-Driven Adaptive Method Generation in Dynamic IoT Environments

Hong Su

Main category: cs.CL

TL;DR: DeMe is a framework that modifies LLM-generated task methods using decorations from hidden goals, learned methods, and environmental feedback to adapt to unseen situations in IoT systems.

DetailsMotivation: Current LLM-based IoT systems lack systematic ability to generate new methods for unseen situations and rely on fixed, device-specific logic that cannot adapt to changing environmental conditions.

Method: Method Decoration (DeMe) modifies LLM method-generation paths using explicit decorations derived from: 1) hidden goals, 2) accumulated learned methods, and 3) environmental feedback. Decorations are extracted from universal behavioral principles, experience, and observed environmental differences, enabling method path reshuffling through pre-decoration, post-decoration, intermediate-step modification, and step insertion.

Result: Experimental results show that method decoration allows IoT devices to derive more appropriate methods when confronting unknown or faulty operating conditions.

Conclusion: DeMe provides a general framework for producing context-aware, safety-aligned, and environment-adaptive methods in intelligent IoT systems, overcoming limitations of existing approaches that lack systematic adaptation to novel situations.

Abstract: Intelligent IoT systems increasingly rely on large language models (LLMs) to generate task-execution methods for dynamic environments. However, existing approaches lack the ability to systematically produce new methods when facing previously unseen situations, and they often depend on fixed, device-specific logic that cannot adapt to changing environmental conditions.In this paper, we propose Method Decoration (DeMe), a general framework that modifies the method-generation path of an LLM using explicit decorations derived from hidden goals, accumulated learned methods, and environmental feedback. Unlike traditional rule augmentation, decorations in DeMe are not hardcoded; instead, they are extracted from universal behavioral principles, experience, and observed environmental differences. DeMe enables the agent to reshuffle the structure of its method path-through pre-decoration, post-decoration, intermediate-step modification, and step insertion-thereby producing context-aware, safety-aligned, and environment-adaptive methods. Experimental results show that method decoration allows IoT devices to derive ore appropriate methods when confronting unknown or faulty operating conditions.

[19] Knowledge Reasoning of Large Language Models Integrating Graph-Structured Information for Pest and Disease Control in Tobacco

Siyu Li, Chenwei Song, Wan Zhou, Xinyi Liu

Main category: cs.CL

TL;DR: A GraphRAG-based LLM approach that integrates knowledge graphs for enhanced reasoning in tobacco pest and disease control, combining Transformer architecture with GNNs and fine-tuned ChatGLM.

DetailsMotivation: To improve knowledge reasoning in the specialized domain of tobacco pest and disease control by integrating structured graph information with LLMs, addressing limitations of traditional LLMs in handling complex domain-specific relationships and multi-hop reasoning.

Method: 1) Use LLMs to construct a domain-specific knowledge graph of tobacco pests/diseases with entities and relationships; 2) Integrate GraphRAG framework for knowledge retrieval; 3) Combine Transformer architecture with GNN for node representation learning; 4) Fine-tune ChatGLM backbone using LoRA for parameter-efficient adaptation.

Result: The approach consistently outperforms baseline methods across multiple evaluation metrics, significantly improving both accuracy and depth of reasoning, especially in complex multi-hop and comparative reasoning scenarios.

Conclusion: Integrating graph-structured information with LLMs through the GraphRAG framework effectively enhances knowledge reasoning in specialized domains like tobacco pest control, demonstrating superior performance in complex reasoning tasks.

Abstract: This paper proposes a large language model (LLM) approach that integrates graph-structured information for knowledge reasoning in tobacco pest and disease control. Built upon the GraphRAG framework, the proposed method enhances knowledge retrieval and reasoning by explicitly incorporating structured information from a domain-specific knowledge graph. Specifically, LLMs are first leveraged to assist in the construction of a tobacco pest and disease knowledge graph, which organizes key entities such as diseases, symptoms, control methods, and their relationships. Based on this graph, relevant knowledge is retrieved and integrated into the reasoning process to support accurate answer generation. The Transformer architecture is adopted as the core inference model, while a graph neural network (GNN) is employed to learn expressive node representations that capture both local and global relational information within the knowledge graph. A ChatGLM-based model serves as the backbone LLM and is fine-tuned using LoRA to achieve parameter-efficient adaptation. Extensive experimental results demonstrate that the proposed approach consistently outperforms baseline methods across multiple evaluation metrics, significantly improving both the accuracy and depth of reasoning, particularly in complex multi-hop and comparative reasoning scenarios.

Baorong Huang, Ali Asiri

Main category: cs.CL

TL;DR: AlignAR introduces a generative sentence alignment method and a new Arabic-English dataset with complex legal/literary texts, showing LLM-based approaches outperform traditional methods on challenging alignment tasks.

DetailsMotivation: Arabic-English parallel corpora are scarce and existing datasets mainly consist of simple one-to-one mappings, lacking the complexity needed to properly evaluate alignment methods for real-world applications like legal and literary translation.

Method: The paper presents AlignAR, a generative sentence alignment method, and creates a new Arabic-English dataset comprising complex legal and literary texts. They evaluate alignment methods on both “Easy” and “Hard” subsets, with the Hard subset specifically designed to reduce one-to-one mappings.

Result: LLM-based approaches demonstrated superior robustness, achieving an overall F1-score of 85.5%, which represents a 9% improvement over previous methods. Traditional alignment methods showed limitations on the challenging “Hard” subset.

Conclusion: “Easy” datasets lack discriminatory power to fully assess alignment methods. Complex datasets like the one presented are necessary for proper evaluation, and LLM-based approaches show significant promise for challenging sentence alignment tasks in Arabic-English translation.

Abstract: High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising complex legal and literary texts. Our evaluation demonstrates that “Easy” datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our “Hard” subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated superior robustness, achieving an overall F1-score of 85.5%, a 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.

[21] HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

Jiaxin Liu, Peiyi Tu, Wenyu Chen, Yihong Zhuang, Xinxia Ling, Anji Zhou, Chenxi Wang, Zhuo Han, Zhengkai Yang, Junbo Zhao, Zenan Huang, Yuanyuan Wang

Main category: cs.CL

TL;DR: HeartBench is a framework for evaluating Chinese LLMs’ anthropomorphic intelligence (emotional, cultural, ethical dimensions) using authentic psychological counseling scenarios, revealing significant performance gaps even in top models.

DetailsMotivation: LLMs show persistent deficits in anthropomorphic intelligence - the ability to handle social, emotional, and ethical nuances, especially in Chinese context where specialized evaluation frameworks and high-quality socio-emotional data are lacking.

Method: Developed HeartBench framework grounded in authentic psychological counseling scenarios with clinical experts, featuring theory-driven taxonomy (5 primary dimensions, 15 secondary capabilities). Uses case-specific, rubric-based methodology with “reasoning-before-scoring” protocol to translate abstract traits into measurable criteria.

Result: Evaluation of 13 state-of-the-art LLMs shows substantial performance ceiling - even leading models achieve only 60% of expert-defined ideal score. Performance decays significantly in “Hard Set” scenarios involving subtle emotional subtexts and complex ethical trade-offs.

Conclusion: HeartBench establishes standardized metrics for anthropomorphic AI evaluation and provides methodological blueprint for constructing high-quality, human-aligned training data, highlighting critical gaps in current LLM capabilities for nuanced social-emotional understanding.

Abstract: While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified Hard Set’’ reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.

[22] From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

Dubai Li, Nan Jiang, Kangping Huang, Ruiqi Tu, Shuyu Ouyang, Huayu Yu, Lin Qiao, Chen Yu, Tianshu Zhou, Danyang Tong, Qian Wang, Mengtao Li, Xiaofeng Zeng, Yu Tian, Xinping Tian, Jingsong Li

Main category: cs.CL

TL;DR: Quicker is an LLM-powered clinical decision support system that automates evidence synthesis and generates clinical recommendations, reducing development time to 20-40 minutes while matching or exceeding human expert performance.

DetailsMotivation: Integrating clinical evidence into real-time practice is challenging due to workload, complex processes, and time constraints, highlighting the need for automated evidence synthesis tools to support efficient and accurate clinical decision-making.

Method: Quicker implements a fully automated chain covering all phases from questions to clinical recommendations, using LLMs to model standard clinical guideline development processes, with integrated tools and interactive user interfaces for customized decision-making.

Result: Quicker showed strong performance on Q2CRBench-3 benchmark: fine-grained question decomposition, retrieval sensitivity comparable to human experts, literature screening approaching comprehensive inclusion, and recommendations more comprehensive/logically coherent than clinicians. System-level testing showed collaboration reduced recommendation development time to 20-40 minutes.

Conclusion: Quicker demonstrates potential to help physicians make quicker and more reliable evidence-based clinical decisions by automating evidence synthesis and supporting clinical guideline development processes.

Abstract: Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker’s capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker’s strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker’s recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.

[23] TimeBill: Time-Budgeted Inference for Large Language Models

Qi Fan, An Zou, Yehan Ma

Main category: cs.CL

TL;DR: TimeBill is a time-budgeted inference framework for LLMs that adaptively adjusts KV cache eviction ratios to meet varying time constraints while maintaining response quality.

DetailsMotivation: LLMs are increasingly used in time-critical systems (robotics, autonomous driving, etc.) where generating accurate responses within specific time budgets is crucial for decision-making and safety. Current methods struggle with modeling execution time and adapting KV cache eviction ratios to varying tasks and time constraints.

Method: Proposes TimeBill framework with: 1) Fine-grained response length predictor (RLP), 2) Execution time estimator (ETE) for accurate end-to-end time prediction, and 3) Time-budgeted efficient inference approach that adaptively adjusts KV cache eviction ratio based on time predictions and given budget.

Result: Extensive experiments demonstrate TimeBill’s advantages in improving task completion rate and maintaining response performance under various overrun strategies.

Conclusion: TimeBill effectively balances inference efficiency and response performance for LLMs in time-critical applications by providing adaptive KV cache management based on accurate time predictions.

Abstract: Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou, Jun Wang, Boyu Xu, Yuyuan Li, Tianyu Du, Shouling Ji

Main category: cs.CL

TL;DR: LVLMs struggle with copyright compliance when processing copyrighted visual content, even with copyright notices. A new benchmark and tool-augmented defense framework are proposed to address this issue.

DetailsMotivation: The widespread accessibility of Large Vision-Language Models (LVLMs) raises critical concerns about potential copyright infringement when they process copyrighted content like book excerpts, news articles, music lyrics, and code documentation. Failure to comply with copyright regulations could lead to serious legal and ethical consequences.

Method: 1) Created a large-scale benchmark dataset of 50,000 multimodal query-content pairs to evaluate copyright compliance; 2) Included scenarios with and without copyright notices, covering four types of copyright notices; 3) Evaluated various LVLMs on their ability to handle copyrighted content; 4) Proposed a novel tool-augmented defense framework for copyright compliance.

Result: Even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting copyrighted content, even when presented with copyright notices. The proposed defense framework reduces infringement risks in all scenarios.

Conclusion: There is an urgent need to develop copyright-aware LVLMs to ensure responsible and lawful use of copyrighted content. The proposed benchmark and defense framework provide important steps toward addressing copyright compliance challenges in multimodal AI systems.

Abstract: Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content – such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.

[25] CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa

Main category: cs.CL

TL;DR: CricBench: A specialized benchmark for evaluating LLMs on cricket analytics SQL tasks in English and Hindi, revealing domain-specific performance gaps and challenging assumptions about English as the optimal prompt language.

DetailsMotivation: Cricket analytics requires complex statistical insights not available through standard searches. LLMs have advanced in Text-to-SQL but their capability to handle domain-specific nuances, complex schema variations, and multilingual requirements in sports analytics remains under-explored.

Method: Created CricBench benchmark suite with domain experts manually authoring complex queries for logical correctness. Built in both English and Hindi with framework for extension to other languages. Evaluated six state-of-the-art models (GPT-4o, Claude 3.7 Sonnet, open-source models) using strict evaluation protocol.

Result: High performance on general benchmarks doesn’t guarantee success in specialized domains. DeepSeek R1 achieved SOTA (50.6%), surpassing Claude 3.7 Sonnet (47.7%) and GPT-4o (33.7%), but showed significant accuracy drop from general benchmarks (BIRD) to CricBench. Code-mixed Hindi queries frequently yielded parity or higher accuracy than English.

Conclusion: Specialized domains like cricket analytics require targeted evaluation. English may not be the optimal prompt language for specialized SQL tasks, challenging conventional assumptions. Domain expertise and multilingual capabilities are crucial for LLM performance in sports analytics.

Abstract: Cricket is the second most popular sport globally, commanding a massive following of over 2.5 billion fans globally. Enthusiasts and analysts frequently seek advanced statistical insights, such as long-term historical performance trends or complex player comparisons, that are often unavailable through standard web searches. While Large Language Models (LLMs) have advanced significantly in Text-to-SQL tasks, their capability to handle the domain-specific nuances, complex schema variations, and multilingual requirements inherent to sports analytics remains under-explored. To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data. To curate a “Gold Standard” dataset, we collaborate with domain experts in cricket and SQL to manually author complex queries, ensuring logical correctness. Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages. We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol. Our results reveal that high performance on general benchmarks does not guarantee success in specialized domains. While the open-weights reasoning model DeepSeek R1 achieves state-of-the-art performance (50.6%), surpassing proprietary giants like Claude 3.7 Sonnet (47.7%) and GPT-4o (33.7%), it still exhibits a significant accuracy drop when moving from general benchmarks (BIRD) to CricBench. Furthermore, we observe that code-mixed Hindi queries frequently yield parity or higher accuracy compared to English, challenging the assumption that English is the optimal prompt language for specialized SQL tasks.

[26] Explainable Statute Prediction via Attention-based Model and LLM Prompting

Sachin Pawar, Girish Keshav Palshikar, Anindita Sinha Banerjee, Nitin Ramrakhiyani, Basit Ali

Main category: cs.CL

TL;DR: This paper proposes two techniques for automatic statute prediction with explanations: AoS (Attention-over-Sentences) using smaller supervised models and LLMPrompt using larger zero-shot LLMs, both evaluated on legal datasets.

DetailsMotivation: To develop AI systems for legal applications (AI-assistant for lawyers, legal QA) that can predict relevant statutes from case descriptions and provide human-understandable explanations for better user acceptance.

Method: Two approaches: (1) AoS uses sentence transformers with attention mechanisms trained in supervised manner; (2) LLMPrompt uses larger language models with zero-shot prompting (standard and Chain-of-Thought) to predict statutes and generate explanations.

Result: Both techniques produce statute predictions with explanations. Performance is compared across two datasets against baselines. Explanation quality is evaluated through automated counter-factual analysis and human evaluation.

Conclusion: The paper presents two complementary approaches for statute prediction with explanations - one supervised (AoS) and one zero-shot (LLMPrompt), both capable of generating human-understandable justifications for legal AI applications.

Abstract: In this paper, we explore the problem of automatic statute prediction where for a given case description, a subset of relevant statutes are to be predicted. Here, the term “statute” refers to a section, a sub-section, or an article of any specific Act. Addressing this problem would be useful in several applications such as AI-assistant for lawyers and legal question answering system. For better user acceptance of such Legal AI systems, we believe the predictions should also be accompanied by human understandable explanations. We propose two techniques for addressing this problem of statute prediction with explanations – (i) AoS (Attention-over-Sentences) which uses attention over sentences in a case description to predict statutes relevant for it and (ii) LLMPrompt which prompts an LLM to predict as well as explain relevance of a certain statute. AoS uses smaller language models, specifically sentence transformers and is trained in a supervised manner whereas LLMPrompt uses larger language models in a zero-shot manner and explores both standard as well as Chain-of-Thought (CoT) prompting techniques. Both these models produce explanations for their predictions in human understandable forms. We compare statute prediction performance of both the proposed techniques with each other as well as with a set of competent baselines, across two popular datasets. Also, we evaluate the quality of the generated explanations through an automated counter-factual manner as well as through human evaluation.

[27] Accelerate Speculative Decoding with Sparse Computation in Verification

Jikai Wang, Jianchao Tan, Yuxuan Hu, Jiayu Qin, Yerui Sun, Yuchen Xie, Xunliang Cai, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: A sparse verification framework for speculative decoding that reduces computation in verification stage by jointly sparsifying attention, FFN, and MoE components with retrieval reuse strategies.

DetailsMotivation: The verification stage in speculative decoding becomes the dominant computational bottleneck, especially for long-context inputs and MoE models, while existing sparsification methods are designed for standard token-by-token decoding.

Method: Systematically adopts different sparse methods on verification stage, identifies structured redundancy across multiple dimensions, proposes sparse verification framework that jointly sparsifies attention, FFN, and MoE components, and incorporates inter-draft token and inter-layer retrieval reuse strategy.

Result: Extensive experiments across summarization, QA, and mathematical reasoning datasets show favorable efficiency-accuracy trade-offs while maintaining stable acceptance length.

Conclusion: The proposed sparse verification framework effectively reduces dominant computation cost in speculative decoding verification stage without additional training, achieving better efficiency-accuracy balance.

Abstract: Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding to remove substantial computational redundancy in LLMs. This work systematically adopts different sparse methods on the verification stage of the speculative decoding and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.

[28] SWE-RM: Execution-free Feedback For Software Engineering Agents

KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He

Main category: cs.CL

TL;DR: The paper introduces SWE-RM, a robust reward model for software engineering agents that improves both test-time scaling and reinforcement learning performance by addressing limitations of execution-based feedback.

DetailsMotivation: Execution-based feedback (like unit testing) has limitations: it requires scalable test case collection, provides sparse feedback, and cannot distinguish between equally successful/unsuccessful trajectories. Execution-free feedback from reward models offers more fine-grained signals but remains underexplored for realistic SWE agents.

Method: The authors identify that TTS performance doesn’t generalize to RL, so they focus on classification accuracy and calibration as crucial RL metrics. They conduct controlled experiments on training data scale, policy mixtures, and data source composition. They then develop SWE-RM, a mixture-of-experts architecture with 30B total parameters (3B activated during inference).

Result: SWE-RM substantially improves SWE agents on both TTS and RL performance. It increases Qwen3-Coder-Flash accuracy from 51.6% to 62.0% and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new SOTA among open-source models.

Conclusion: Execution-free reward models like SWE-RM can provide more effective feedback for SWE agents than execution-based methods, with proper attention to classification accuracy and calibration for RL generalization. The mixture-of-experts architecture enables robust performance across both TTS and RL paradigms.

Abstract: Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model’s ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

[29] Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Sachin Pawar, Manoj Apte, Kshitij Jadhav, Girish Keshav Palshikar, Nitin Ramrakhiyani

Main category: cs.CL

TL;DR: The paper hypothesizes that breaking natural words into multiple tokens negatively impacts LLM performance, proposes penalty functions to quantify tokenization quality, and validates this statistically across multiple LLMs and NLP tasks.

DetailsMotivation: LLM tokenization differs from traditional NLP tokenization by splitting natural words into multiple tokens due to limited vocabulary size, potentially harming model performance on NLP tasks.

Method: Proposes a set of penalty functions that compute tokenization penalty for given text, indicating how “bad” the tokenization is for a specific LLM.

Result: Establishes statistical significance of the hypothesis that breaking natural words negatively impacts LLM performance on multiple NLP tasks across different LLMs.

Conclusion: Tokenization quality matters for LLM performance; proposed penalty functions can quantify this effect, showing statistically significant negative impact of word fragmentation.

Abstract: Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model’s fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of “natural” words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral’s tokenizer splits “martial” into “mart” and “ial”). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how “bad” the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.

[30] Self-attention vector output similarities reveal how machines pay attention

Tal Halevi, Yarden Tzach, Ronit D. Gross, Shalom Rosner, Ido Kanter

Main category: cs.CL

TL;DR: This paper introduces a method to quantify information processing in self-attention mechanisms, analyzing BERT-12 to reveal how attention heads specialize in different linguistic features and how similarity patterns evolve across layers.

DetailsMotivation: While self-attention has advanced NLP significantly, the precise mechanisms underlying its learning process and quantitative characterization remain poorly understood. The study aims to develop new approaches for quantifying information processing within self-attention mechanisms.

Method: The study introduces a new approach for quantifying information processing in self-attention by analyzing BERT-12 architecture. It examines attention maps, derives context similarity matrices measuring scalar products between token vectors, and analyzes vector space emerging from self-attention heads across different layers.

Result: Analysis reveals that final layers focus on sentence separator tokens (suggesting text segmentation approach), different attention heads specialize in different linguistic characteristics (token repetitions, common tokens with context), and similarity patterns evolve from long-range to short-range across layers, culminating in strong within-sentence similarities. Each head focuses on unique tokens to build similarity pairs.

Conclusion: The study provides quantitative insights into self-attention mechanisms, revealing systematic specialization of attention heads and evolution of similarity patterns across layers, offering practical implications for text segmentation and understanding how transformers process linguistic information.

Abstract: The self-attention mechanism has significantly advanced the field of natural language processing, facilitating the development of advanced language-learning machines. Although its utility is widely acknowledged, the precise mechanisms of self-attention underlying its advanced learning and the quantitative characterization of this learning process remains an open research question. This study introduces a new approach for quantifying information processing within the self-attention mechanism. The analysis conducted on the BERT-12 architecture reveals that, in the final layers, the attention map focuses on sentence separator tokens, suggesting a practical approach to text segmentation based on semantic features. Based on the vector space emerging from the self-attention heads, a context similarity matrix, measuring the scalar product between two token vectors was derived, revealing distinct similarities between different token vector pairs within each head and layer. The findings demonstrated that different attention heads within an attention block focused on different linguistic characteristics, such as identifying token repetitions in a given text or recognizing a token of common appearance in the text and its surrounding context. This specialization is also reflected in the distribution of distances between token vectors with high similarity as the architecture progresses. The initial attention layers exhibit substantially long-range similarities; however, as the layers progress, a more short-range similarity develops, culminating in a preference for attention heads to create strong similarities within the same sentence. Finally, the behavior of individual heads was analyzed by examining the uniqueness of their most common tokens in their high similarity elements. Each head tends to focus on a unique token from the text and builds similarity pairs centered around it.

[31] Context as a Tool: Context Management for Long-Horizon SWE-Agents

Shukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo, Xianglong Liu, Bryan Dai

Main category: cs.CL

TL;DR: CAT introduces a proactive context management paradigm for LLM-based software engineering agents that actively compresses historical interactions into structured summaries to maintain bounded context while preserving reasoning quality.

DetailsMotivation: Existing LLM-based SWE agents suffer from context explosion, semantic drift, and degraded reasoning in long-running interactions due to append-only context maintenance or passive compression heuristics.

Method: Proposes CAT: a callable context management tool with structured workspace (stable task semantics, condensed long-term memory, high-fidelity short-term interactions). Introduces CAT-GENERATOR framework for trajectory-level supervision and trains SWE-Compressor model.

Result: SWE-Compressor achieves 57.6% solved rate on SWE-Bench-Verified, significantly outperforming ReAct-based agents and static compression baselines while maintaining stable reasoning under bounded context budget.

Conclusion: Proactive context management integrated into agent decision-making enables scalable long-horizon reasoning for repository-scale software engineering tasks by preventing context explosion and semantic drift.

Abstract: Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.

[32] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

Duygu Altinok

Main category: cs.CL

TL;DR: Introduces TrGLUE, the first comprehensive Turkish NLU benchmark, plus SentiTurca for sentiment analysis, with semi-automated dataset creation and evaluation tools.

DetailsMotivation: While GLUE-style benchmarks exist for English (GLUE), Chinese (CLUE), French (FLUE), and Japanese (JGLUE), there is no comparable comprehensive benchmark for Turkish language NLU evaluation, creating a significant gap in Turkish NLP research.

Method: Created Turkish-native corpora mirroring GLUE-style domains and tasks using a semi-automated pipeline: strong LLM-based annotation, cross-model agreement checks, and human validation. Provides fine-tuning and evaluation code for transformer models.

Result: TrGLUE benchmark for Turkish NLU and SentiTurca for sentiment analysis. The method prioritizes linguistic naturalness, minimizes translation artifacts, and establishes a scalable, reproducible workflow for dataset creation.

Conclusion: TrGLUE establishes a robust evaluation framework for Turkish NLU, provides valuable resources for researchers, and offers insights into generating high-quality semi-automated datasets for other languages.

Abstract: Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.

[33] MAD: Multi-Alignment MEG-to-Text Decoding

Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, Hui Xiong

Main category: cs.CL

TL;DR: Novel end-to-end multi-alignment framework for translating MEG brain signals into text, achieving state-of-the-art performance on unseen text generation.

DetailsMotivation: Address three key gaps in current brain-computer interface research: 1) limited exploration of MEG despite its superior signal quality over EEG, 2) poor generalization to unseen text, and 3) insufficient multimodal integration for comprehensive brain activity understanding.

Method: Introduces an end-to-end multi-alignment framework for speech-decoding that translates MEG signals into text, specifically designed for totally unseen text generation directly from brain activity.

Result: Achieves impressive BLEU-1 score of 6.86 on the GWilliams dataset, significantly outperforming the baseline of 5.49, demonstrating substantial improvement in text generation from MEG signals.

Conclusion: The proposed multi-alignment framework advances MEG-based brain-computer interface research toward real-world applications and shows strong potential for improving text generation from brain activity.

Abstract: Deciphering language from brain activity is a crucial task in brain-computer interface (BCI) research. Non-invasive cerebral signaling techniques including electroencephalography (EEG) and magnetoencephalography (MEG) are becoming increasingly popular due to their safety and practicality, avoiding invasive electrode implantation. However, current works under-investigated three points: 1) a predominant focus on EEG with limited exploration of MEG, which provides superior signal quality; 2) poor performance on unseen text, indicating the need for models that can better generalize to diverse linguistic contexts; 3) insufficient integration of information from other modalities, which could potentially constrain our capacity to comprehensively understand the intricate dynamics of brain activity. This study presents a novel approach for translating MEG signals into text using a speech-decoding framework with multiple alignments. Our method is the first to introduce an end-to-end multi-alignment framework for totally unseen text generation directly from MEG signals. We achieve an impressive BLEU-1 score on the \textit{GWilliams} dataset, significantly outperforming the baseline from 5.49 to 6.86 on the BLEU-1 metric. This improvement demonstrates the advancement of our model towards real-world applications and underscores its potential in advancing BCI research. Code is available at $\href{https://github.com/NeuSpeech/MAD-MEG2text}{https://github.com/NeuSpeech/MAD-MEG2text}$.

[34] GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion

Tongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, Jing Li

Main category: cs.CL

TL;DR: Proposes group-based multi-agent debate method that reduces token costs by up to 51.7% while potentially improving accuracy by 25% through intra-group debates and inter-group result sharing.

DetailsMotivation: Multi-agent debates improve logical reasoning but become expensive with more agents/rounds due to high token costs, limiting scalability. Need to reduce token consumption while maintaining or enhancing debate benefits.

Method: Divide agents into multiple debate groups, conduct debates within groups, and share interim debate results between groups to reduce redundant token usage while preserving collective reasoning benefits.

Result: Reduces total tokens by up to 51.7% during debates while potentially enhancing accuracy by as much as 25% across multiple datasets.

Conclusion: Group-based multi-agent debate significantly improves performance and efficiency of interactions, making multi-agent debates more scalable and cost-effective for logical reasoning tasks.

Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse NLP tasks. Extensive research has explored how to enhance the logical reasoning abilities such as Chain-of-Thought, Chain-of-Thought with Self-Consistency, Tree-Of-Thoughts, and multi-agent debates. In the context of multi-agent debates, significant performance improvements can be achieved with an increasing number of agents and debate rounds. However, the escalation in the number of agents and debate rounds can drastically raise the tokens cost of debates, thereby limiting the scalability of the multi-agent debate technique. To better harness the advantages of multi-agent debates in logical reasoning tasks, this paper proposes a method to significantly reduce token cost in multi-agent debates. This approach involves dividing all agents into multiple debate groups, with agents engaging in debates within their respective groups and sharing interim debate results between groups. Comparative experiments across multiple datasets have demonstrated that this method can reduce the total tokens by up to 51.7% during debates and while potentially enhancing accuracy by as much as 25%. Our method significantly enhances the performance and efficiency of interactions in the multi-agent debate.

[35] Don’t Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

Main category: cs.CL

TL;DR: PLANT introduces a plug-and-play attention initialization strategy using pretrained Learning-to-Rank models guided by mutual information gain to improve Extreme Multi-Label Text Classification, especially in few-shot settings and for rare labels.

DetailsMotivation: Current Extreme Multi-Label Text Classification models struggle with learning good attention weights, which are crucial for focusing on key tokens in input text. The challenge is particularly pronounced in few-shot settings and for rare labels.

Method: PLANT uses a pretrained Learning-to-Rank model guided by mutual information gain to plant label-specific attention weights. This architecture-agnostic approach can be integrated with various large language model backbones like Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3.

Result: PLANT outperforms state-of-the-art methods across multiple tasks including ICD coding, legal topic classification, and content recommendation. Gains are especially significant in few-shot settings with substantial improvements on rare labels.

Conclusion: Attention initialization is a key driver of performance gains in Extreme Multi-Label Text Classification. PLANT’s plug-and-play approach effectively addresses the challenge of learning good attention weights, particularly benefiting few-shot learning and rare label classification.

Abstract: State-of-the-art Extreme Multi-Label Text Classification models rely on multi-label attention to focus on key tokens in input text, but learning good attention weights is challenging. We introduce PLANT - Pretrained and Leveraged Attention - a plug-and-play strategy for initializing attention. PLANT works by planting label-specific attention using a pretrained Learning-to-Rank model guided by mutual information gain. This architecture-agnostic approach integrates seamlessly with large language model backbones such as Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3. PLANT outperforms state-of-the-art methods across tasks including ICD coding, legal topic classification, and content recommendation. Gains are especially pronounced in few-shot settings, with substantial improvements on rare labels. Ablation studies confirm that attention initialization is a key driver of these gains. For code and trained models, see https://github.com/debjyotiSRoy/xcube/tree/plant

[36] An Exploration of Higher Education Course Evaluation by Large Language Models

Bo Yuan, Jiazi Hu

Main category: cs.CL

TL;DR: LLMs can perform automated course evaluation at micro (classroom discussion) and macro (holistic course) levels, with fine-tuned Llama showing superior reliability and correlation with human evaluators.

DetailsMotivation: Traditional course evaluation methods (surveys, observations, expert reviews) suffer from subjectivity, high labor costs, and limited scalability. Recent LLM advancements offer opportunities for consistent, fine-grained, and scalable evaluations.

Method: Used three representative LLMs to evaluate courses at micro level (classroom discussion analysis) and macro level (holistic course review). Analyzed classroom interaction transcripts and 100 courses from a major Chinese institution. Compared fine-tuned Llama with other models.

Result: LLMs can extract key pedagogical features and generate structured evaluation results aligned with expert judgement. Fine-tuned Llama showed superior reliability, producing score distributions with greater differentiation and stronger correlation with human evaluators than other models.

Conclusion: LLM-based evaluation is a promising practical tool for quality assurance and educational decision-making in large-scale higher education settings, providing systematic, interpretable evaluations and actionable teaching improvement insights.

Abstract: Course evaluation plays a critical role in ensuring instructional quality and guiding curriculum development in higher education. However, traditional evaluation methods, such as student surveys, classroom observations, and expert reviews, are often constrained by subjectivity, high labor costs, and limited scalability. With recent advancements in large language models (LLMs), new opportunities have emerged for generating consistent, fine-grained, and scalable course evaluations. This study investigates the use of three representative LLMs for automated course evaluation at both the micro level (classroom discussion analysis) and the macro level (holistic course review). Using classroom interaction transcripts and a dataset of 100 courses from a major institution in China, we demonstrate that LLMs can extract key pedagogical features and generate structured evaluation results aligned with expert judgement. A fine-tuned version of Llama shows superior reliability, producing score distributions with greater differentiation and stronger correlation with human evaluators than its counterparts. The results highlight three major findings: (1) LLMs can reliably perform systematic and interpretable course evaluations at both the micro and macro levels; (2) fine-tuning and prompt engineering significantly enhance evaluation accuracy and consistency; and (3) LLM-generated feedback provides actionable insights for teaching improvement. These findings illustrate the promise of LLM-based evaluation as a practical tool for supporting quality assurance and educational decision-making in large-scale higher education settings.

[37] A Comparison of DeepSeek and Other LLMs

Tianchen Gao, Jiashun Jin, Zheng Tracy Ke, Gabriel Moryoussef

Main category: cs.CL

TL;DR: DeepSeek outperforms Gemini, GPT, and Llama in classification tasks but underperforms Claude; it’s slower but cheaper, with outputs most similar to Gemini and Claude.

DetailsMotivation: To compare DeepSeek with other popular LLMs (Claude, Gemini, GPT, Llama) on classification tasks and understand its performance characteristics relative to competitors.

Method: Used two classification settings: authorship classification (human vs AI) and citation classification (4 types). Compared DeepSeek against 4 LLMs on these tasks, measured accuracy, speed, cost, and output similarity. Also created fully-labeled dataset and proposed recipe using LLMs and MADStat dataset to generate new datasets.

Result: DeepSeek outperformed Gemini, GPT, and Llama in most cases but underperformed Claude. DeepSeek is comparably slower but cheaper to use, while Claude is much more expensive. DeepSeek’s outputs are most similar to Gemini and Claude.

Conclusion: DeepSeek offers competitive performance at lower cost, making it a practical choice despite being slower. The created datasets serve as benchmarks for future LLM research.

Abstract: Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of “predicting an outcome using a short text” for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with $4$ popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all $5$ LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.

[38] Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback

Moghis Fereidouni, Md Sajid Ahmed, Adib Mosharrof, A. B. Siddique

Main category: cs.CL

TL;DR: RealTOD improves LLM-based task-oriented dialog systems through prompt chaining and fine-grained feedback, achieving significant accuracy gains on benchmark datasets.

DetailsMotivation: Instruction-tuned LLMs struggle with reliable multi-turn task completion in TOD settings, especially when generating API calls to interact with external systems.

Method: Introduces RealTOD framework with two key components: (1) prompt chaining for zero-shot generalization by synthesizing schema-aligned in-context examples, and (2) fine-grained feedback that verifies API calls against domain schemas, identifies errors, and provides targeted correction prompts.

Result: RealTOD improves Full API accuracy by 37.10% on SGD benchmark compared to AutoTOD, and by 10.32% on BiTOD compared to supervised baseline SimpleTOD. Human evaluations confirm superior task completion, fluency, and informativeness.

Conclusion: RealTOD effectively addresses LLM limitations in multi-turn task completion through innovative prompt engineering and feedback mechanisms, significantly improving reliability in task-oriented dialog systems.

Abstract: Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language. While instruction-tuned large language models (LLMs) have demonstrated strong performance on a range of single-turn NLP tasks, they often struggle with reliable multi-turn task completion in TOD settings, particularly when generating API calls required to interact with external systems. To address this, we introduce RealTOD, a novel framework that improves LLM-based TOD systems through (1) prompt chaining and (2) fine-grained feedback. Prompt chaining enables zero-shot generalization to new domains by automatically synthesizing a schema-aligned in-context example for the target task. Fine-grained feedback verifies each generated API call against the domain schema, identifies specific errors, and provides targeted correction prompts. To evaluate task completion reliability, we introduce full API Call Accuracy as a robust metric, along with detailed sub-metrics to capture common failure modes. We conduct extensive experiments on the SGD and BiTOD benchmarks using four LLMs. RealTOD improves Full API accuracy, surpassing state-of-the-art AutoTOD by 37.10% on SGD and supervised learning-based baseline SimpleTOD by 10.32% on BiTOD. Human evaluations further confirm that LLMs integrated with RealTOD achieve superior task completion, fluency, and informativeness compared to existing methods.

[39] A Causal Lens for Evaluating Faithfulness Metrics

Kerem Zaman, Shashank Srivastava

Main category: cs.CL

TL;DR: Causal Diagnosticity is a testbed framework for evaluating faithfulness metrics of LLM explanations, using model-editing to create faithful/unfaithful explanation pairs across four tasks, finding current metrics vary in performance and need improvement.

DetailsMotivation: LLM explanations may seem plausible but not reflect true model reasoning. Existing faithfulness metrics are evaluated in isolation without principled comparisons, creating a need for standardized evaluation framework.

Method: Proposed Causal Diagnosticity framework using diagnosticity concept and model-editing methods to generate faithful-unfaithful explanation pairs. Benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. Evaluates prominent faithfulness metrics including post-hoc explanation and chain-of-thought methods.

Result: Diagnostic performance varies across tasks and models, with Filler Tokens performing best overall. Continuous metrics are generally more diagnostic than binary ones but can be sensitive to noise and model choice. No single metric consistently outperforms others across all tasks.

Conclusion: Current faithfulness metrics need improvement for robust evaluation of LLM explanations. The proposed Causal Diagnosticity framework enables principled comparison and highlights the need for more reliable faithfulness assessment methods.

Abstract: Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model’s true reasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with Filler Tokens performing best overall. Additionally, continuous metrics are generally more diagnostic than binary ones but can be sensitive to noise and model choice. Our results highlight the need for more robust faithfulness metrics.

[40] AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs

Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Bo Du, Jing Zhang

Main category: cs.CL

TL;DR: Researchers introduce AnesSuite, the first comprehensive dataset suite for anesthesiology reasoning in LLMs, featuring AnesBench evaluation benchmark and training datasets, plus Morpheus baseline models showing substantial performance improvements despite limited training.

DetailsMotivation: LLMs have gained attention in medical applications, but their reasoning capabilities in specialized domains like anesthesiology remain underexplored, creating a gap that needs to be addressed.

Method: Created AnesSuite with AnesBench evaluation benchmark (three reasoning levels: factual retrieval, hybrid reasoning, complex decision-making) and three training datasets for CPT, SFT, and RLVR. Developed Morpheus baseline models using SFT and group relative policy optimization (GRPO).

Result: Morpheus demonstrates substantial performance improvements rivaling larger-scale models despite limited training. Comprehensive evaluations analyze key factors influencing anesthesiology reasoning performance including model characteristics, training strategies, and training data.

Conclusion: AnesSuite provides the first comprehensive infrastructure for anesthesiology reasoning in LLMs, with Morpheus establishing strong baselines. Both resources will be open-sourced to advance research in this specialized medical domain.

Abstract: The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus demonstrates substantial performance improvements, rivaling the performance of larger-scale models. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced at https://github.com/MiliLab/AnesSuite.

[41] Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

Cong Qi, Hanzhang Fang, Tianxing Hu, Siqi Jiang, Wei Zhi

Main category: cs.CL

TL;DR: GeneMamba is a scalable foundation model for single-cell RNA-seq using state space modeling with linear-time complexity, outperforming transformer-based methods on tasks like batch integration and cell type annotation.

DetailsMotivation: scRNA-seq data has high dimensionality, sparsity, and batch effects that pose computational challenges. Transformer models have limitations including quadratic complexity and suboptimal handling of long-range dependencies.

Method: GeneMamba uses Bi-Mamba architecture based on state space modeling to capture bidirectional gene context with linear-time complexity. Pretrained on nearly 30 million cells with biologically informed objectives including pathway-aware contrastive loss and rank-based gene encoding.

Result: GeneMamba demonstrates strong performance, interpretability, and robustness across diverse tasks including multi-batch integration, cell type annotation, and gene-gene correlation, outperforming transformer baselines.

Conclusion: GeneMamba represents a practical and powerful alternative to transformer-based methods, advancing biologically grounded, scalable tools for large-scale single-cell data analysis.

Abstract: Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.

[42] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

Haoyuan Wu, Xueyi Chen, Rui Ming, Jilong Gao, Shoubo Hu, Zhuolun He, Bei Yu

Main category: cs.CL

TL;DR: ToTRL is a reinforcement learning framework that transforms LLMs’ sequential chain-of-thought reasoning into parallel tree-of-thought reasoning using puzzle games as training tasks, improving performance and efficiency.

DetailsMotivation: Current LLMs use long chain-of-thought reasoning that produces verbose outputs and follows trial-and-error approaches rather than systematic deduction. Tree-of-thoughts offers better parallel exploration but needs to be developed from existing CoT capabilities.

Method: ToTRL (tree-of-thoughts RL) - an on-policy RL framework with rule-based rewards that guides LLMs to develop parallel ToT strategies from sequential CoT strategies. Uses LLMs as players in puzzle games for training, as puzzles require exploring interdependent choices and managing constraints that naturally demand thought tree construction.

Result: ToTQwen3-8B model trained with ToTRL shows significant improvement in performance and reasoning efficiency on complex reasoning tasks compared to baseline approaches.

Conclusion: The ToTRL framework successfully transforms LLMs’ reasoning from sequential chain-of-thought to parallel tree-of-thought, leading to more efficient and effective reasoning on complex tasks through puzzle game-based training.

Abstract: Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.

[43] Advancing Expert Specialization for Better MoE

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang

Main category: cs.CL

TL;DR: The paper proposes two complementary loss functions (orthogonality loss and variance loss) to improve expert specialization in Mixture-of-Experts models by reducing expert overlap and encouraging more discriminative routing, achieving up to 23.79% improvement over baselines.

DetailsMotivation: Current MoE models using auxiliary load balancing loss often lead to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training.

Method: Introduces two complementary objectives: (1) orthogonality loss to encourage experts to process distinct types of tokens, and (2) variance loss to encourage more discriminative routing decisions. These are compatible with existing auxiliary loss and optimize training without architectural changes.

Result: Experimental results across various model architectures and benchmarks show significant enhancement of expert specialization, improving classic MoE baselines with auxiliary loss by up to 23.79% while maintaining load balancing in downstream tasks.

Conclusion: The proposed simple yet effective solution addresses expert overlap and uniform routing issues in MoE models, significantly improving performance through better expert specialization without requiring architectural modifications or additional components.

Abstract: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.

[44] Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study

Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong

Main category: cs.CL

TL;DR: The paper introduces a framework to evaluate LLMs’ general learning ability across three dimensions: Learning from Instructor, Learning from Concept, and Learning from Experience, with empirical findings showing interaction improves learning, conceptual understanding is scale-emergent, and LLMs are effective few-shot but not many-shot learners.

DetailsMotivation: LLMs have shown impressive capabilities in various tasks, but their learning ability - crucial for adapting to dynamic environments and acquiring new knowledge - remains underexplored. The paper aims to address this gap by developing a comprehensive framework to evaluate LLMs' general learning abilities.

Method: The authors introduce a framework inspired by cognitive psychology and education that decomposes general learning ability into three complementary dimensions: 1) Learning from Instructor (acquiring knowledge via explicit guidance), 2) Learning from Concept (internalizing abstract structures and generalizing to new contexts), and 3) Learning from Experience (adapting through accumulated exploration and feedback). They conduct comprehensive empirical studies across these dimensions.

Result: Several key findings: (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; (iii) LLMs are effective few-shot learners but not many-shot learners. Based on these findings, the authors introduce a benchmark for unified and realistic evaluation of LLMs’ general learning abilities.

Conclusion: The proposed framework and benchmark provide diagnostic insights into LLMs’ learning capabilities and support the evaluation and development of more adaptive and human-like models. The work addresses the underexplored area of LLMs’ learning ability and offers a structured approach to understanding how LLMs learn across different cognitive dimensions.

Abstract: Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs’ general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.

[45] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion

Yutong Wu, Di Huang, Ruosi Wan, Yue Peng, Shijie Shang, Chenrui Cao, Lei Qi, Rui Zhang, Zidong Du, Jie Yan, Xing Hu

Main category: cs.CL

TL;DR: ThinkingF improves autoformalization by enhancing both formal language mastery and natural language reasoning through data synthesis and training, achieving state-of-the-art results on benchmark datasets.

DetailsMotivation: Existing autoformalization methods using LLMs still suffer from low accuracy due to insufficient formal-language domain knowledge and weak reasoning capability for natural language understanding and informal-formal alignment.

Method: ThinkingF uses a data synthesis and training pipeline: (1) constructs two datasets - one with formal knowledge examples, another with informal-to-formal reasoning trajectories using expert templates; (2) applies SFT and RLVR to fuse and refine both abilities.

Result: The resulting 7B and 32B models achieve SOTA BEq@1 scores: 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

Conclusion: ThinkingF successfully addresses the dual challenges of formal knowledge mastery and reasoning capability for autoformalization, demonstrating that comprehensive formal knowledge combined with strong informal-to-formal reasoning leads to significant performance improvements.

Abstract: Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

[46] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan

Main category: cs.CL

TL;DR: DySK-Attn is a framework that enables LLMs to efficiently integrate real-time knowledge from dynamic external sources using sparse knowledge attention over a knowledge graph.

DetailsMotivation: LLMs have static knowledge that quickly becomes outdated, and retraining them is computationally prohibitive. Existing knowledge editing techniques are slow and may cause side effects.

Method: Synergizes an LLM with a dynamic Knowledge Graph that can be updated instantaneously. Uses a sparse knowledge attention mechanism for coarse-to-fine grained search to efficiently identify relevant facts from the KG.

Result: Significantly outperforms strong baselines (RAG and model editing techniques) in both factual accuracy for updated knowledge and computational efficiency on time-sensitive QA tasks.

Conclusion: DySK-Attn offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world by efficiently integrating real-time knowledge.

Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

[47] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints

Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh

Main category: cs.CL

TL;DR: A computational economics framework treats LLMs as economies of resource-constrained agents (attention heads/neurons) that allocate scarce computation to maximize task utility, enabling 40% FLOPS reduction while maintaining accuracy.

DetailsMotivation: Large language models have substantial computational costs that limit their practical deployment. There's a need for principled approaches to make LLMs more efficient under strict resource constraints while maintaining performance.

Method: Proposes a computational economics framework where LLMs are treated as internal economies of resource-constrained agents (attention heads and neuron blocks). Uses incentive-driven training that augments task loss with a differentiable computation cost term to encourage sparse and efficient activations.

Result: On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method produces models that trace a Pareto frontier and consistently outperform post-hoc pruning. Achieves roughly 40% reduction in FLOPS and lower latency for similar accuracy, with more interpretable attention patterns.

Conclusion: Economic principles provide a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints, offering a framework for computational resource allocation in neural networks.

Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a “computational economics” framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.

[48] The Cultural Gene of Large Language Models: A Study on the Impact of Cross-Corpus Training on Model Values and Biases

Emanuel Z. Fenech-Borg, Tilen P. Meznaric-Kos, Milica D. Lekovic-Bojovic, Arni J. Hentze-Djurhuus

Main category: cs.CL

TL;DR: LLMs exhibit cultural biases reflecting their training data - Western models show individualistic/low-power-distance tendencies while Eastern models show collectivistic/high-power-distance tendencies.

DetailsMotivation: To investigate cultural biases in LLMs and quantify how they inherit value orientations from their training corpora, addressing concerns about algorithmic cultural hegemony.

Method: Created Cultural Probe Dataset (CPD) with 200 prompts targeting Individualism-Collectivism and Power Distance dimensions. Compared GPT-4 (Western) and ERNIE Bot (Eastern) using standardized zero-shot prompts with human annotation. Computed Cultural Alignment Index against Hofstede’s national scores.

Result: Significant cultural divergence: GPT-4 shows individualistic/low-power-distance tendencies (IDV ~1.21, PDI ~-1.05), ERNIE Bot shows collectivistic/high-power-distance tendencies (IDV ~-0.89, PDI ~0.76). GPT-4 aligns more with USA cultural scores, ERNIE Bot with China.

Conclusion: LLMs function as statistical mirrors of their cultural training corpora, highlighting the need for culturally aware evaluation and deployment to prevent algorithmic cultural hegemony.

Abstract: Large language models (LLMs) are deployed globally, yet their underlying cultural and ethical assumptions remain underexplored. We propose the notion of a “cultural gene” – a systematic value orientation that LLMs inherit from their training corpora – and introduce a Cultural Probe Dataset (CPD) of 200 prompts targeting two classic cross-cultural dimensions: Individualism-Collectivism (IDV) and Power Distance (PDI). Using standardized zero-shot prompts, we compare a Western-centric model (GPT-4) and an Eastern-centric model (ERNIE Bot). Human annotation shows significant and consistent divergence across both dimensions. GPT-4 exhibits individualistic and low-power-distance tendencies (IDV score approx 1.21; PDI score approx -1.05), while ERNIE Bot shows collectivistic and higher-power-distance tendencies (IDV approx -0.89; PDI approx 0.76); differences are statistically significant (p < 0.001). We further compute a Cultural Alignment Index (CAI) against Hofstede’s national scores and find GPT-4 aligns more closely with the USA (e.g., IDV CAI approx 0.91; PDI CAI approx 0.88) whereas ERNIE Bot aligns more closely with China (IDV CAI approx 0.85; PDI CAI approx 0.81). Qualitative analyses of dilemma resolution and authority-related judgments illustrate how these orientations surface in reasoning. Our results support the view that LLMs function as statistical mirrors of their cultural corpora and motivate culturally aware evaluation and deployment to avoid algorithmic cultural hegemony.

[49] Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

Markus Oehri, Giulia Conti, Kaviraj Pather, Alexandre Rossi, Laia Serra, Adrian Parody, Rogvi Johannesen, Aviaja Petersen, Arben Krasniqi

Main category: cs.CL

TL;DR: UniCR is a unified framework that converts various uncertainty evidence into calibrated correctness probabilities and enforces user-specified error budgets through principled refusal, improving trustworthiness without fine-tuning base models.

DetailsMotivation: Language models need to know not only what to answer but also when not to answer. Current approaches lack unified methods to combine heterogeneous uncertainty evidence into calibrated probabilities with formal error guarantees.

Method: UniCR learns a lightweight calibration head with temperature scaling and proper scoring, converts evidence (sequence likelihoods, self-consistency dispersion, retrieval compatibility, tool/verifier feedback) into calibrated correctness probabilities, and uses conformal risk control for distribution-free guarantees. For long-form generation, it aligns confidence with semantic fidelity using atomic factuality scores from retrieved evidence.

Result: Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under risk-coverage curve, and higher coverage at fixed risk compared to baselines. Evidence contradiction, semantic dispersion, and tool inconsistency are key abstention drivers.

Conclusion: UniCR provides a portable recipe from evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning base models, remains valid under distribution shift, and yields informative refusal messages.

Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

[50] CC-GSEO-Bench: A Content-Centric Benchmark for Measuring Source Influence in Generative Search Engines

Qiyuan Chen, Jiahe Chen, Hongsen Huang, Qian Shao, Jintai Chen, Renjie Hua, Hongxia Xu, Ruijia Wu, Ren Chuan, Jian Wu

Main category: cs.CL

TL;DR: CC-GSEO-Bench is a content-centric benchmark for evaluating source article influence on Generative Search Engines’ synthesized answers across diverse intents and follow-up questions.

DetailsMotivation: Generative Search Engines synthesize answers from multiple sources, breaking the traditional link between search ranking and visibility, creating a need to quantify source influence on GSE outputs.

Method: Created CC-GSEO-Bench with 1,000+ source articles and 5,000+ query-article pairs using realistic retrieval: seed queries from public QA datasets with limited augmentation, retaining only queries where source reappears in follow-up retrieval. Operationalized influence across Exposure, Faithful Credit, Causal Impact, Readability/Structure, and Trustworthiness/Safety dimensions.

Result: Developed a benchmark with article-level evaluation structure (one-to-many query clusters) that aggregates query-level signals to summarize influence strength, coverage, and stability, enabling empirical characterization of influence dynamics across content patterns.

Conclusion: CC-GSEO-Bench provides a comprehensive framework for content creators to measure and understand their articles’ influence on GSE outputs, addressing the visibility challenge in the era of generative search.

Abstract: Generative Search Engines (GSEs) synthesize conversational answers from multiple sources, weakening the long-standing link between search ranking and digital visibility. This shift raises a central question for content creators: How can we reliably quantify a source article’s influence on a GSE’s synthesized answer across diverse intents and follow-up questions? We introduce CC-GSEO-Bench, a content-centric benchmark that couples a large-scale dataset with a creator-centered evaluation framework. The dataset contains over 1,000 source articles and over 5,000 query-article pairs, organized in a one-to-many structure for article-level evaluation. We ground construction in realistic retrieval by combining seed queries from public QA datasets with limited synthesized augmentation and retaining only queries whose paired source reappears in a follow-up retrieval step. On top of this dataset, we operationalize influence along three core dimensions: Exposure, Faithful Credit, and Causal Impact, and two content-quality dimensions: Readability and Structure, and Trustworthiness and Safety. We aggregate query-level signals over each article’s query cluster to summarize influence strength, coverage, and stability, and empirically characterize influence dynamics across representative content patterns.

[51] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish

Main category: cs.CL

TL;DR: A cloud-edge collaborative architecture with multi-agent LLMs (GuideLLM, SolverLLM, JudgeLLM) is proposed for coding tasks, evaluated using the new RefactorCoderQA benchmark. The fine-tuned RefactorCoder-MoE model achieves 76.84% accuracy, outperforming existing models.

DetailsMotivation: To address limitations in existing benchmarks and improve LLMs' reasoning and problem-solving capabilities for multi-domain coding tasks by creating a more comprehensive evaluation framework and efficient cloud-edge architecture.

Method: 1) Cloud-edge collaborative architecture with three specialized LLM agents: GuideLLM (edge), SolverLLM (cloud), JudgeLLM (evaluator). 2) RefactorCoderQA benchmark covering Software Engineering, Data Science, ML, NLP using Stack Overflow challenges. 3) RefactorCoder-MoE model fine-tuned from DeepSeek-Coder-7B-Instruct using QLoRA for domain-specific coding QA.

Result: RefactorCoder-MoE achieves 76.84% overall accuracy, significantly outperforming all evaluated open-source and commercial baselines on the RefactorCoderQA benchmark.

Conclusion: The proposed cloud-edge collaborative architecture with multi-agent prompting and the RefactorCoderQA benchmark effectively enhances LLM performance for multi-domain coding tasks, with RefactorCoder-MoE demonstrating state-of-the-art results.

Abstract: To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud and responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of LLMs across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers multiple technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges sourced from Stack Overflow. We propose RefactorCoder-MoE, a fine-tuned mixture-of-experts (MoE) code language model based on DeepSeek-Coder-7B-Instruct, adapted to the RefactorCoderQA benchmark using QLoRA for domain-specific coding question answering. Extensive experiments demonstrate that RefactorCoder-MoE achieves strong and competitive performance, significantly outperforming all evaluated open-source and commercial baselines, with an overall accuracy of 76.84%.

[52] BARD: budget-aware reasoning distillation

Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang, Wenbo Su, Bo zheng

Main category: cs.CL

TL;DR: BARD is a budget-aware reasoning distillation framework that enables fine-grained control over reasoning length while transferring reasoning capabilities to smaller models.

DetailsMotivation: Current CoT distillation methods produce redundant reasoning processes with uncontrollable computational budgets, leading to inefficient resource usage in smaller language models.

Method: Two-phase training: 1) SFT on teacher-generated long CoT data compressed to various budget levels, 2) RL with reward balancing reasoning performance and budget fidelity.

Result: An 8B student model achieves strong performance on challenging reasoning benchmarks (AIME24, AIME25, GPQA) with precise adaptive control over reasoning length across budgets.

Conclusion: BARD successfully enables simultaneous distillation of reasoning capability and fine-grained control over reasoning length, addressing computational efficiency concerns in reasoning models.

Abstract: While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model’s understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

[53] TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine

Zihao Cheng, Yuheng Lu, Huaiqian Ye, Zeming Liu, Minqi Wang, Jingjing Liu, Zihan Li, Wei Fan, Yuanfang Guo, Ruiji Fu, Shifeng She, Gang Wang, Yunhong Wang

Main category: cs.CL

TL;DR: TCM-Eval: First dynamic benchmark for Traditional Chinese Medicine with SI-CoTE method to enrich training data, resulting in ZhiMingTang LLM that exceeds human practitioner passing thresholds.

DetailsMotivation: LLMs have shown promise in modern medicine but face limitations in Traditional Chinese Medicine due to lack of standardized benchmarks and high-quality training data.

Method: Created TCM-Eval benchmark from national medical licensing exams, developed Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich QA pairs with validated reasoning chains through rejection sampling, establishing data-model co-evolution cycle.

Result: Developed ZhiMingTang (ZMT), a state-of-the-art LLM for TCM that significantly exceeds passing thresholds for human practitioners. Released public leaderboard for community engagement.

Conclusion: The work addresses critical gaps in TCM AI by providing standardized evaluation, scalable data generation methods, and a high-performing specialized LLM, enabling future research and development in Traditional Chinese Medicine.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.

[54] Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Xueqi Cheng

Main category: cs.CL

TL;DR: Bi-RAR is a bidirectional retrieval-augmented reasoning framework that evaluates intermediate reasoning steps in both forward and backward directions using information distance metrics, optimized through multi-objective reinforcement learning to improve complex multi-step reasoning.

DetailsMotivation: Current RAG approaches struggle with complex multi-step reasoning scenarios. Most methods use outcome-based supervision without explicit guidance for intermediate steps, leading to reward hacking and degraded response quality. There's a need for better evaluation and optimization of intermediate reasoning steps.

Method: Proposes Bi-RAR framework with bidirectional information distance based on Kolmogorov complexity (approximated via LM generation probabilities) to assess information completeness of each step. Uses multi-objective reinforcement learning with cascading reward structure emphasizing early trajectory alignment.

Result: Empirical results on seven question answering benchmarks show Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with search engines during both training and inference.

Conclusion: Bi-RAR effectively addresses limitations of current RAG approaches by providing bidirectional evaluation of intermediate reasoning steps, leading to improved performance in complex multi-step reasoning tasks through better optimization of reasoning trajectories.

Abstract: Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios. Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.

[55] Adaptive Focus Memory for Language Models

Christopher Cruz

Main category: cs.CL

TL;DR: AFM is a lightweight context management system that dynamically assigns fidelity levels to past messages (Full, Compressed, Placeholder) based on relevance, temporal decay, and importance, enabling reliable constraint preservation in multi-turn dialogues under fixed token budgets.

DetailsMotivation: Current LLM dialogue systems use naive history management strategies - either replaying full conversations (costly) or using recency-based truncation/static summarization (causes early important constraints to drift out of context). Models retain text without reliably applying critical constraints when needed.

Method: Adaptive Focus Memory (AFM) dynamically assigns each past message one of three fidelity levels: Full (complete text), Compressed (summarized), or Placeholder (minimal representation). Decisions based on semantic relevance, temporal decay, and importance classification. Messages packed chronologically under fixed token budget.

Result: AFM succeeds in 83.3% of allergy scenario runs where all baselines fail, and preserves correct refusal behavior on tax compliance benchmark. Demonstrates reliable constraint preservation under bounded context growth without model weight modifications or external retrieval infrastructure.

Conclusion: Effective dialogue memory requires more than retaining prior text - selectively allocating fidelity across past messages enables reliable constraint preservation. AFM provides lightweight solution compatible with existing chat APIs, supporting reproducible research and practical deployment.

Abstract: Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, yet their behavior remains bottlenecked by naive history management strategies. Replaying the full conversation at every turn is simple but costly, while recency-based truncation or static summarization often causes early, high-impact user constraints to drift out of effective context. As a result, models may retain text without reliably applying it when it matters. We present Adaptive Focus Memory (AFM), a lightweight context management system that dynamically assigns each past message one of three fidelity levels: Full, Compressed, or Placeholder, based on semantic relevance, temporal decay, and importance classification. AFM packs messages chronologically under a fixed token budget, preserving critical constraints at high fidelity while allowing low-importance context to degrade gracefully. We evaluate AFM on two multi-turn dialogue benchmarks designed to stress long-horizon constraint preservation: a safety-critical travel scenario involving a user with a severe peanut allergy, and a policy-critical tax compliance scenario involving an illegal evasion request. Under strict grading that requires both explicit constraint recall and appropriately conditioned generation, AFM succeeds in 83.3 percent of allergy runs where all baseline strategies fail, and preserves correct refusal behavior on the tax benchmark. These results demonstrate that effective dialogue memory requires more than retaining prior text. Selectively allocating fidelity across past messages enables reliable constraint preservation under bounded context growth, without modifying model weights or introducing external retrieval infrastructure. We release an open-source implementation of AFM compatible with OpenAI-style chat APIs to support reproducible research and practical deployment.

[56] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun

Main category: cs.CL

TL;DR: fMRI-LM is a foundational model that bridges fMRI brain imaging with language through neural tokenization, joint modeling with LLMs, and multi-task instruction tuning, enabling semantic understanding of brain activity.

DetailsMotivation: While multimodal LLMs handle images, audio, and video, extending this capability to brain imaging remains unexplored. Bridging fMRI with language is essential to link neural activity with semantic cognition and develop cross-modal brain representations.

Method: Three-stage framework: 1) Neural tokenizer maps fMRI to discrete tokens in language-consistent space; 2) Pretrained LLM adapted to jointly model fMRI tokens and text; 3) Multi-task, multi-paradigm instruction tuning for high-level semantic understanding. Overcomes lack of fMRI-text pairs by constructing descriptive corpus translating imaging features to structured text.

Result: Achieves strong zero-shot and few-shot performance across benchmarks, adapts efficiently with parameter-efficient tuning (LoRA), establishing scalable pathway toward language-aligned universal model for fMRI understanding.

Conclusion: fMRI-LM successfully bridges fMRI and language, enabling semantic understanding of brain activity and providing a foundation for cross-modal brain representation learning and diverse downstream applications.

Abstract: Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

[57] FVA-RAG: Falsification-Verification Alignment for Mitigating Sycophantic Hallucinations

Mayank Ravishankara

Main category: cs.CL

TL;DR: FVA-RAG is a new RAG pipeline that inverts standard workflow by treating initial responses as hypotheses and retrieving counter-evidence to test them, significantly reducing retrieval sycophancy and improving accuracy on TruthfulQA.

DetailsMotivation: Standard RAG systems suffer from retrieval sycophancy - they preferentially retrieve evidence that supports user premises even when those premises are false, leading to hallucinations and confirmation bias.

Method: FVA-RAG inverts standard RAG workflow: 1) Generate initial draft hypothesis, 2) Explicitly retrieve anti-context (counter-evidence) to stress-test the hypothesis, 3) Use falsification-verification alignment to mitigate premise-confirming hallucinations.

Result: Achieves 79.80-80.05% accuracy on TruthfulQA-Generation benchmark (N=817), significantly outperforming Self-RAG (71.11-72.22%) and CRAG (71.36-73.93%) with p < 10^-6. Triggers falsification on 24.5-29.3% of queries.

Conclusion: Targeted counter-evidence retrieval is decisive for mitigating premise-confirming hallucinations in RAG systems, and the falsification-verification alignment approach effectively addresses retrieval sycophancy.

Abstract: Retrieval-Augmented Generation (RAG) reduces hallucinations by grounding answers in retrieved evidence, yet standard retrievers often exhibit retrieval sycophancy: they preferentially surface evidence that supports a user’s premise, even when the premise is false. We propose FVA-RAG (Falsification-Verification Alignment RAG), a pipeline that inverts the standard RAG workflow by treating the initial response as a draft hypothesis and explicitly retrieving anti-context to stress-test it. We evaluate on the full TruthfulQA-Generation benchmark (N=817) under a fully frozen protocol with 0 live web calls and identical retrieval budgets across methods. Using gpt-4o for generation and deterministic judging, FVA-RAG achieves 79.80-80.05% accuracy across two independently built frozen corpora , significantly outperforming prompted variants of Self-RAG (71.11-72.22%) and CRAG (71.36-73.93%) with p < 10^-6 according to McNemar’s test. FVA-RAG triggers falsification on 24.5-29.3% of queries, demonstrating that targeted counter-evidence retrieval is decisive for mitigating premise-confirming hallucinations.

[58] FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

Binbin Xu

Main category: cs.CL

TL;DR: FineFreq is a large-scale multilingual character frequency dataset covering 1900+ languages from 2013-2025, with 96 trillion characters processed from 57TB of text, providing per-character statistics with temporal analysis capabilities.

DetailsMotivation: To create a comprehensive character frequency dataset that preserves natural multilingual features (cross-script borrowings, emoji, acronyms) without artificial filtering, enabling fine-grained temporal analysis across languages.

Method: Derived from FineWeb and FineWeb2 corpora, processing 57TB of compressed text to extract frequency counts for 96 trillion characters. Includes Unicode metadata (category, script, block) for each character entry.

Result: Created a dataset covering over 1900 languages spanning 2013-2025, with per-character statistics including aggregate and year-level frequencies. Released in CSV and Parquet formats with associated metadata on GitHub and HuggingFace.

Conclusion: FineFreq provides a valuable resource for multilingual character frequency analysis, enabling domain-specific filtering and temporal studies while preserving natural linguistic features across a wide range of languages.

Abstract: We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq

[59] Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation

Kei Saito

Main category: cs.CL

TL;DR: NRR (Non-Resolution Reasoning) proposes retaining ambiguity as valid reasoning, challenging AI’s tendency to prematurely collapse multiple interpretations into single outputs.

DetailsMotivation: Current AI systems prematurely resolve ambiguity by collapsing multiple valid interpretations into single outputs, limiting their interpretive flexibility and reasoning capabilities.

Method: Introduces three principles: Non-Identity (A≠A), Approximate Identity (A≈A), and Non-Resolution; formalized through Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking (CIT).

Result: NRR-lite maintains high entropy (H=0.63) at ambiguous turns while standard architectures collapse early (H=0.10), demonstrating preserved interpretive flexibility until context arrives.

Conclusion: The key question is not whether AI should resolve ambiguity, but when, how, and under whose control, advocating for ambiguity retention as a valid reasoning mode.

Abstract: Current AI systems exhibit a fundamental limitation: they resolve ambiguity prematurely. This premature semantic collapse–collapsing multiple valid interpretations into single outputs–stems from classical identity assumptions in neural architectures. We propose Non-Resolution Reasoning (NRR), treating ambiguity retention as a valid reasoning mode. NRR introduces three principles: (1) Non-Identity ($A \neq A$)–the same symbol refers to different entities across contexts; (2) Approximate Identity ($A \approx A$)–entities share partial overlap without being identical; (3) Non-Resolution–conflicting interpretations coexist without forced convergence. We formalize these through Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking (CIT). Functional verification via Turn 1 Entropy measurement shows NRR-lite maintains high entropy ($H = 0.63$) at ambiguous turns while standard architectures collapse early ($H = 0.10$), demonstrating that NRR preserves interpretive flexibility until context arrives. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.

[60] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

Mykola Kuz, Ihor Lazarovych, Mykola Kozlenko, Mykola Pikuliak, Andrii Kvasniuk

Main category: cs.CL

TL;DR: Hybrid LSTM-CNN-Attention model with GloVe embeddings achieves 0.98 accuracy for web content classification, outperforming individual CNN, LSTM, and BERT baselines.

DetailsMotivation: To enhance web content classification by addressing limitations of individual neural architectures, combining their strengths to capture both fine-grained text structures and broader semantic context for improved generalization.

Method: Proposes a hybrid deep learning architecture integrating LSTM (for long-range dependencies), CNN (for local n-gram patterns), and Attention mechanism (for selective focus on informative parts), using pretrained GloVe embeddings for word representation. Evaluated with 5-fold cross-validation.

Result: Achieved outstanding performance: accuracy 0.98, precision 0.94, recall 0.92, F1-score 0.93, surpassing baseline models (CNN-only, LSTM-only, and BERT).

Conclusion: The hybrid architecture effectively combines syntactic feature extraction and semantic interpretation, demonstrating high effectiveness for web content classification and supporting broader use of hybrid deep learning approaches in NLP applications with complex textual data.

Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

[61] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

Bingyang Kelvin Liu, Ziyu Patrick Chen, David P. Woodruff

Main category: cs.CL

TL;DR: JEPA-Reasoner: A JEPA-based architecture with generative ability for latent reasoning, using separate Talker model for text reconstruction to avoid compounding errors in autoregressive generation.

DetailsMotivation: JEPA lacks generative abilities, while current latent reasoning models suffer from token-by-token generation issues like compounding errors and heavy context dependency.

Method: Proposed JEPA-Reasoner architecture enhanced with generative ability for latent reasoning, augmented with separate Talker model to reconstruct human-readable text from latent representations.

Result: Decoupling latent-space reasoning from token production enables production of mixed latent vectors, supports multi-threaded reasoning, and achieves superior robustness against compounding errors.

Conclusion: JEPA-Reasoner addresses limitations of both JEPA (lack of generation) and current latent reasoning models (autoregressive errors), providing a foundation for more robust reasoning systems.

Abstract: While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, current latent reasoning models remain limited by the token-by-token generation paradigm, which suffers from compounding errors and heavy context dependency. To address these limitations, we proposed JEPA-Reasoner, a novel JEPA-based architecture enhanced with generative ability for latent reasoning. We augment this architecture with a separate action-talker model, Talker, to reconstruct human-readable text from latent representations produced by the JEPA-Reasoner. Our work demonstrated that decoupling latent-space reasoning from token production enables JEPA-Reasoner to produce mixed latent vectors, laying a foundation for multi-threaded reasoning and achieving superior robustness against compounding errors in autoregressive generation.

[62] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang

Main category: cs.CL

TL;DR: MobileWorld is a new, more challenging mobile-use benchmark with 201 tasks across 20 apps, featuring long-horizon workflows, multi-app tasks, and novel evaluation categories like agent-user interaction and MCP-augmented tasks.

DetailsMotivation: AndroidWorld has become saturated with agents achieving over 90% success rates, lacks key application categories (e-commerce, enterprise communication), and doesn't reflect realistic mobile-use scenarios with vague instructions and hybrid tool usage.

Method: Created MobileWorld with 201 tasks across 20 applications using open-source alternatives to industry standards (e.g., Mattermost for Slack) for reproducible evaluation. Features long-horizon, cross-application workflows with nearly double the completion steps and significantly more multi-app tasks than AndroidWorld. Includes novel task categories for agent-user interaction and MCP-augmented tasks.

Result: MobileWorld reveals a sharp performance drop: best agentic framework achieves 51.7% success rate, best end-to-end model achieves only 20.9% success rate, compared to AndroidWorld’s over 90% success rates, highlighting substantial room for improvement.

Conclusion: MobileWorld provides a substantially more challenging and realistic benchmark for mobile-use agents, exposing current limitations and creating ample headroom for future research in agent capabilities for complex, real-world mobile scenarios.

Abstract: Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks across 20 applications. MobileWorld derives its difficulty from an emphasis on long-horizon, cross-application workflows, requiring nearly twice as many completion steps on average (27.8 vs. 14.3) and featuring a significantly higher proportion of multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. To overcome the limitations of existing environments, MobileWorld achieves a balance between production-grade utility and reproducible evaluation by utilizing open-source alternatives to industry standards (e.g., Mattermost for Slack). This approach enables a fully observable and controlled environment through source code modification and direct backend database access for precise verification. MobileWorld also introduces novel task categories, including agent-user interaction and Model Context Protocol (MCP)-augmented tasks, for evaluating agents in user-aware, hybrid-tool scenarios. To facilitate evaluation, we develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting ample headroom for future research.

[63] HARMON-E: Hierarchical Agentic Reasoning for Multimodal Oncology Notes to Extract Structured Data

Shashi Kant Gupta, Arijeet Pramanik, Jerrin John Thomas, Regina Schwind, Lauren Wiener, Avi Raju, Jeremy Kornbluth, Yanshan Wang, Zhaohui Su, Hrituraj Singh

Main category: cs.CL

TL;DR: LLM-based agentic framework achieves high accuracy (F1=0.93) for extracting structured oncology data from 400K+ unstructured EHR notes, significantly reducing manual annotation costs.

DetailsMotivation: Unstructured EHR notes contain vital oncology information but are challenging to extract due to variability, specialized terminology, and inconsistent formats. Manual abstraction is costly and unscalable, while existing automated approaches are too narrow and don't handle patient-level synthesis across contradictory documents.

Method: Agentic framework using LLMs as reasoning agents with context-sensitive retrieval and iterative synthesis capabilities. Systematically decomposes complex oncology data extraction into modular, adaptive tasks to extract structured clinical variables from real-world oncology notes.

Result: Achieved average F1-score of 0.93 on 400,000+ unstructured notes and PDF reports from 2,250 cancer patients. 100 out of 103 oncology-specific clinical variables exceeded 0.85 F1, with critical variables (biomarkers, medications) surpassing 0.95. Integration into workflow resulted in 0.94 manual approval rate, significantly reducing annotation costs.

Conclusion: First exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale, demonstrating high accuracy and practical utility for real-world clinical data curation.

Abstract: Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale

[64] Step-DeepResearch Technical Report

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

Main category: cs.CL

TL;DR: Step-DeepResearch is a cost-effective 32B parameter agent that achieves expert-level deep research capabilities through atomic capability synthesis, progressive training, and checklist-style verification, scoring 61.4% on research rubrics and rivaling closed-source SOTA models.

DetailsMotivation: Existing academic benchmarks like BrowseComp fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. There's also an evaluation gap in the Chinese domain for realistic deep research scenarios.

Method: 1) Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing; 2) Progressive training path from agentic mid-training to SFT and RL; 3) Checklist-style Judger for improved robustness; 4) Establishment of ADR-Bench for Chinese domain evaluation.

Result: Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch.

Conclusion: Refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency, proving that well-designed training approaches can make smaller models competitive with much larger closed-source alternatives.

Abstract: As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

cs.CV

[65] Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation

Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta, Pratik Narang, Dhruv Kumar

Main category: cs.CV

TL;DR: Proposes a data-driven framework using Vision-Language Models to extract audiovisual features, cluster them into interpretable factors, and predict engagement on short-form edutainment videos, outperforming traditional metrics.

DetailsMotivation: Existing video evaluation frameworks like VideoScore-2 assess visual/semantic fidelity but fail to capture how specific audiovisual attributes drive real audience engagement. Need human-aligned, multimodal reasoning for short-form video content.

Method: Uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos using a curated YouTube Shorts dataset.

Result: Strong correlations between predicted and actual engagement, demonstrating that the lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (SSIM, FID).

Conclusion: By grounding evaluation in both multimodal feature importance and human-centered engagement signals, the approach advances toward robust and explainable video understanding for short-form content.

Abstract: Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.

[66] A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding

Christina Liu, Alan Q. Wang, Joy Hsu, Jiajun Wu, Ehsan Adeli

Main category: cs.CV

TL;DR: Tool Bottleneck Framework (TBF) improves medical image understanding by using a learned neural network to compose specialized tool outputs instead of text-based composition, achieving better performance especially in data-limited scenarios.

DetailsMotivation: Current tool-use frameworks for vision-language models rely on text-based composition of tools, which performs poorly on medical images where crucial information is spatially-localized and difficult to describe in text alone.

Method: TBF uses an off-the-shelf medical VLM to select clinically-relevant tools, then composes their outputs via a learned Tool Bottleneck Model (TBM) - a neural network that fuses tool features before making final predictions.

Result: TBF performs on par with or better than deep learning classifiers, VLMs, and state-of-the-art tool-use frameworks on histopathology and dermatology tasks, with particular advantages in data-limited regimes.

Conclusion: The framework improves medical image understanding by enabling better tool composition through learned neural fusion rather than text, yielding more interpretable and clinically-grounded predictors.

Abstract: Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.

[67] Data relativistic uncertainty framework for low-illumination anime scenery image enhancement

Yiquan Gao, John See

Main category: cs.CV

TL;DR: Proposes DRU framework for low-light enhancement of anime scenery images using data relativistic uncertainty to address illumination uncertainty and domain gap.

DetailsMotivation: Addresses the domain gap in low-light enhancement for anime scenery images, which is underexplored compared to natural images/videos, and tackles data scarcity in this niche domain.

Method: Constructs unpaired anime scenery dataset, proposes Data Relativistic Uncertainty (DRU) framework inspired by Relativistic GAN, quantifies illumination uncertainty using wave-particle duality analogy, and dynamically adjusts objective functions to recalibrate learning under data uncertainty.

Result: DRU framework yields superior perceptual and aesthetic qualities beyond state-of-the-art methods when applied to EnlightenGAN variants, demonstrating effectiveness of learning from data uncertainty perspective.

Conclusion: The framework exposes a novel data-centric learning paradigm applicable to visual and language domains, with code made available for further research.

Abstract: By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.

[68] Scalable Deep Subspace Clustering Network

Nairouz Mrabah, Mohamed Bouguessa, Sihem Sami

Main category: cs.CV

TL;DR: SDSNet is a scalable deep subspace clustering method with O(n) complexity using landmark approximation to avoid O(n³) computational bottlenecks of traditional methods.

DetailsMotivation: Traditional subspace clustering methods suffer from O(n³) computational complexity due to constructing full affinity matrices and spectral decomposition, making them impractical for large datasets. Deep learning approaches improve feature extraction but maintain the same computational bottleneck through exhaustive pairwise similarity computations.

Method: SDSNet uses: (1) landmark-based approximation to avoid full affinity matrices, (2) joint optimization of auto-encoder reconstruction with self-expression objectives, and (3) direct spectral clustering on factorized representations. It combines convolutional auto-encoders with subspace-preserving constraints.

Result: Experimental results show SDSNet achieves comparable clustering quality to state-of-the-art methods while providing significantly improved computational efficiency with O(n) complexity.

Conclusion: SDSNet successfully addresses the scalability limitations of subspace clustering methods by reducing computational complexity from O(n³) to O(n) while maintaining competitive clustering performance through landmark approximation and deep learning integration.

Abstract: Subspace clustering methods face inherent scalability limits due to the $O(n^3)$ cost (with $n$ denoting the number of data samples) of constructing full $n\times n$ affinities and performing spectral decomposition. While deep learning-based approaches improve feature extraction, they maintain this computational bottleneck through exhaustive pairwise similarity computations. We propose SDSNet (Scalable Deep Subspace Network), a deep subspace clustering framework that achieves $\mathcal{O}(n)$ complexity through (1) landmark-based approximation, avoiding full affinity matrices, (2) joint optimization of auto-encoder reconstruction with self-expression objectives, and (3) direct spectral clustering on factorized representations. The framework combines convolutional auto-encoders with subspace-preserving constraints. Experimental results demonstrate that SDSNet achieves comparable clustering quality to state-of-the-art methods with significantly improved computational efficiency.

[69] Intelligent recognition of GPR road hidden defect images based on feature fusion and attention mechanism

Haotian Lv, Yuhui Zhang, Jiangbo Dai, Hanli Wu, Jiaji Wang, Dawei Wang

Main category: cs.CV

TL;DR: A novel deep learning framework (MCGA-Net) for automated GPR image analysis achieves high accuracy in detecting subsurface road defects through data augmentation, multi-modal feature fusion, and attention mechanisms.

DetailsMotivation: Conventional GPR image interpretation relies heavily on subjective expertise, leading to inefficiencies and inaccuracies in detecting subsurface road defects. There's a need for automated, objective analysis methods.

Method: Three-part framework: 1) DCGAN-based data augmentation to synthesize GPR images and address data scarcity; 2) MCGA-Net with Multi-modal Chain Feature Fusion (MCFF) for hierarchical multi-scale representation and Global Attention Mechanism (GAM) for context-aware enhancement; 3) MS COCO transfer learning for backbone network fine-tuning.

Result: MCGA-Net achieves Precision (92.8%), Recall (92.5%), and mAP@50 (95.9%). The model demonstrates robustness against Gaussian noise, weak signals, and small targets, outperforming other models in complex subsurface environments.

Conclusion: The framework establishes a new paradigm for automated GPR-based defect detection, balancing computational efficiency with high accuracy in complex subsurface environments, reducing reliance on subjective expertise.

Abstract: Ground Penetrating Radar (GPR) has emerged as a pivotal tool for non-destructive evaluation of subsurface road defects. However, conventional GPR image interpretation remains heavily reliant on subjective expertise, introducing inefficiencies and inaccuracies. This study introduces a comprehensive framework to address these limitations: (1) A DCGAN-based data augmentation strategy synthesizes high-fidelity GPR images to mitigate data scarcity while preserving defect morphology under complex backgrounds; (2) A novel Multi-modal Chain and Global Attention Network (MCGA-Net) is proposed, integrating Multi-modal Chain Feature Fusion (MCFF) for hierarchical multi-scale defect representation and Global Attention Mechanism (GAM) for context-aware feature enhancement; (3) MS COCO transfer learning fine-tunes the backbone network, accelerating convergence and improving generalization. Ablation and comparison experiments validate the framework’s efficacy. MCGA-Net achieves Precision (92.8%), Recall (92.5%), and mAP@50 (95.9%). In the detection of Gaussian noise, weak signals and small targets, MCGA-Net maintains robustness and outperforms other models. This work establishes a new paradigm for automated GPR-based defect detection, balancing computational efficiency with high accuracy in complex subsurface environments.

[70] BertsWin: Resolving Topological Sparsity in 3D Masked Autoencoders via Component-Balanced Structural Optimization

Evgeny Alves Limarenko, Anastasiia Studenikina

Main category: cs.CV

TL;DR: BertsWin: A hybrid SSL architecture combining BERT-style token masking with Swin Transformer windows for 3D medical imaging, achieving 5.8x faster convergence and 15x reduction in training epochs while maintaining computational efficiency.

DetailsMotivation: Standard Masked Autoencoders (MAE) struggle with 3D volumetric medical images due to difficulty capturing three-dimensional spatial relationships, especially when discarding 75% of tokens. There's a need for better SSL methods that preserve 3D spatial topology while maintaining computational efficiency.

Method: Proposes BertsWin - hybrid architecture combining full BERT-style token masking with Swin Transformer windows. Uses complete 3D grid of tokens (masked and visible) to preserve spatial topology, single-level local Swin windows to handle quadratic complexity, and introduces structural priority loss function.

Result: Achieves 5.8x faster semantic convergence compared to standard ViT-MAE baselines. With GradientConductor optimizer, achieves 15-fold reduction in training epochs (44 vs 660) to reach state-of-the-art reconstruction fidelity. Maintains theoretical FLOP parity with sparse ViT baselines at canonical resolutions.

Conclusion: BertsWin effectively addresses 3D SSL challenges by preserving complete spatial topology while maintaining computational efficiency, enabling practical application to volumetric medical imaging like TMJ segmentation in CT scans.

Abstract: The application of self-supervised learning (SSL) and Vision Transformers (ViTs) approaches demonstrates promising results in the field of 2D medical imaging, but the use of these methods on 3D volumetric images is fraught with difficulties. Standard Masked Autoencoders (MAE), which are state-of-the-art solution for 2D, have a hard time capturing three-dimensional spatial relationships, especially when 75% of tokens are discarded during pre-training. We propose BertsWin, a hybrid architecture combining full BERT-style token masking using Swin Transformer windows, to enhance spatial context learning in 3D during SSL pre-training. Unlike the classic MAE, which processes only visible areas, BertsWin introduces a complete 3D grid of tokens (masked and visible), preserving the spatial topology. And to smooth out the quadratic complexity of ViT, single-level local Swin windows are used. We introduce a structural priority loss function and evaluate the results of cone beam computed tomography of the temporomandibular joints. The subsequent assessment includes TMJ segmentation on 3D CT scans. We demonstrate that the BertsWin architecture, by maintaining a complete three-dimensional spatial topology, inherently accelerates semantic convergence by a factor of 5.8x compared to standard ViT-MAE baselines. Furthermore, when coupled with our proposed GradientConductor optimizer, the full BertsWin framework achieves a 15-fold reduction in training epochs (44 vs 660) required to reach state-of-the-art reconstruction fidelity. Analysis reveals that BertsWin achieves this acceleration without the computational penalty typically associated with dense volumetric processing. At canonical input resolutions, the architecture maintains theoretical FLOP parity with sparse ViT baselines, resulting in a significant net reduction in total computational resources due to faster convergence.

[71] CCAD: Compressed Global Feature Conditioned Anomaly Detection

Xiao Jin, Liang Diao, Qixin Xiao, Yifan Hu, Ziqi Zhang, Yuchen Liu, Haisong Gu

Main category: cs.CV

TL;DR: CCAD combines reconstruction-based and representation-based anomaly detection by using compressed global features as conditioning for reconstruction, improving performance and efficiency.

DetailsMotivation: Current anomaly detection methods have limitations: unsupervised representation methods struggle with domain shift, while reconstruction methods suffer from low training efficiency and performance degradation due to insufficient constraints.

Method: CCAD synergizes both paradigms by adapting global features as a new modality condition for reconstruction models, with an adaptive compression mechanism to enhance generalization and training efficiency.

Result: CCAD consistently outperforms state-of-the-art methods in AUC while achieving faster convergence, validated on reorganized DAGM 2007 dataset with new annotations.

Conclusion: CCAD effectively addresses limitations of existing anomaly detection approaches by combining strengths of reconstruction and representation methods through compressed global feature conditioning.

Abstract: Anomaly detection holds considerable industrial significance, especially in scenarios with limited anomalous data. Currently, reconstruction-based and unsupervised representation-based approaches are the primary focus. However, unsupervised representation-based methods struggle to extract robust features under domain shift, whereas reconstruction-based methods often suffer from low training efficiency and performance degradation due to insufficient constraints. To address these challenges, we propose a novel method named Compressed Global Feature Conditioned Anomaly Detection (CCAD). CCAD synergizes the strengths of both paradigms by adapting global features as a new modality condition for the reconstruction model. Furthermore, we design an adaptive compression mechanism to enhance both generalization and training efficiency. Extensive experiments demonstrate that CCAD consistently outperforms state-of-the-art methods in terms of AUC while achieving faster convergence. In addition, we contribute a reorganized and re-annotated version of the DAGM 2007 dataset with new annotations to further validate our method’s effectiveness. The code for reproducing main results is available at https://github.com/chloeqxq/CCAD.

[72] IMA++: ISIC Archive Multi-Annotator Dermoscopic Skin Lesion Segmentation Dataset

Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh

Main category: cs.CV

TL;DR: ISIC MultiAnnot++ is the largest publicly available multi-annotator skin lesion segmentation dataset with 17,684 masks across 14,967 dermoscopic images, including annotator metadata.

DetailsMotivation: There is a lack of large-scale publicly available multi-annotator skin lesion segmentation datasets with annotator-level labels for dermoscopic imaging, which is crucial for studying annotator variability and preference modeling.

Method: Created ISIC MultiAnnot++ by collecting and organizing segmentation masks from the ISIC Archive, including 2,394 images with 2-5 segmentations per image, along with annotator metadata (skill level, segmentation tool).

Result: The dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, making it the largest publicly available SLS dataset with comprehensive annotator metadata.

Conclusion: ISIC MultiAnnot++ enables research on annotator-specific preference modeling, metadata analysis, and provides curated data partitions and consensus segmentation masks for the community.

Abstract: Multi-annotator medical image segmentation is an important research problem, but requires annotated datasets that are expensive to collect. Dermoscopic skin lesion imaging allows human experts and AI systems to observe morphological structures otherwise not discernable from regular clinical photographs. However, currently there are no large-scale publicly available multi-annotator skin lesion segmentation (SLS) datasets with annotator-labels for dermoscopic skin lesion imaging. We introduce ISIC MultiAnnot++, a large public multi-annotator skin lesion segmentation dataset for images from the ISIC Archive. The final dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, where 2,394 dermoscopic images have 2-5 segmentations per image, making it the largest publicly available SLS dataset. Further, metadata about the segmentation, including the annotators’ skill level and segmentation tool, is included, enabling research on topics such as annotator-specific preference modeling for segmentation and annotator metadata analysis. We provide an analysis on the characteristics of this dataset, curated data partitions, and consensus segmentation masks.

[73] GPF-Net: Gated Progressive Fusion Learning for Polyp Re-Identification

Suncheng Xiang, Xiaoyang Wang, Junjie Jiang, Hejia Wang, Dahong Qian

Main category: cs.CV

TL;DR: Proposes Gated Progressive Fusion network for colonoscopic polyp re-identification, using multi-level feature fusion with gating mechanisms to improve matching accuracy.

DetailsMotivation: Colonoscopic polyp re-identification is crucial for colorectal cancer prevention, but existing methods using coarse high-level features perform poorly on small polyps where detailed information matters.

Method: Gated Progressive Fusion network that selectively fuses features from multiple levels using gates in a fully connected way, with layer-wise refinement of semantic information through multi-level feature interactions.

Result: Experiments on standard benchmarks show benefits over state-of-the-art unimodal ReID models, especially when combined with specialized multimodal fusion strategy.

Conclusion: The proposed architecture effectively addresses the challenge of polyp re-identification by leveraging multi-level feature fusion with gating mechanisms, improving performance particularly for small polyps.

Abstract: Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, the coarse resolution of high-level features of a specific polyp often leads to inferior results for small objects where detailed information is important. To address this challenge, we propose a novel architecture, named Gated Progressive Fusion network, to selectively fuse features from multiple levels using gates in a fully connected way for polyp ReID. On the basis of it, a gated progressive fusion strategy is introduced to achieve layer-wise refinement of semantic information through multi-level feature interactions. Experiments on standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.

[74] Generative Multi-Focus Image Fusion

Xinzhe Xie, Buyu Guo, Bolin Li, Shuangyan He, Yanzhen Gu, Qingyan Jiang, Peiliang Li

Main category: cs.CV

TL;DR: GMFF is a two-stage generative multi-focus image fusion framework that first performs deterministic fusion using StackMFF V4, then uses IFControlNet (latent diffusion model) for generative restoration to handle missing focal planes and eliminate edge artifacts.

DetailsMotivation: Existing multi-focus image fusion methods assume at least one input image has each spatial location in focus, and suffer from edge artifacts due to uncertain focus estimation or hard-selection operations in complex real-world scenarios.

Method: Two-stage cascaded framework: 1) Deterministic fusion using StackMFF V4 with focal plane information to produce initial fused image; 2) Generative restoration using IFControlNet (latent diffusion model) to reconstruct missing focal plane content, restore fine details, and eliminate edge artifacts.

Result: GMFF achieves state-of-the-art fusion performance and shows significant potential for practical applications, especially with complex multi-focal content. Implementation is publicly available.

Conclusion: The proposed GMFF framework effectively addresses limitations of existing methods by combining deterministic fusion with generative restoration, handling missing focal planes and eliminating edge artifacts for superior multi-focus image fusion.

Abstract: Multi-focus image fusion aims to generate an all-in-focus image from a sequence of partially focused input images. Existing fusion algorithms generally assume that, for every spatial location in the scene, there is at least one input image in which that location is in focus. Furthermore, current fusion models often suffer from edge artifacts caused by uncertain focus estimation or hard-selection operations in complex real-world scenarios. To address these limitations, we propose a generative multi-focus image fusion framework, termed GMFF, which operates in two sequential stages. In the first stage, deterministic fusion is implemented using StackMFF V4, the latest version of the StackMFF series, and integrates the available focal plane information to produce an initial fused image. The second stage, generative restoration, is realized through IFControlNet, which leverages the generative capabilities of latent diffusion models to reconstruct content from missing focal planes, restore fine details, and eliminate edge artifacts. Each stage is independently developed and functions seamlessly in a cascaded manner. Extensive experiments demonstrate that GMFF achieves state-of-the-art fusion performance and exhibits significant potential for practical applications, particularly in scenarios involving complex multi-focal content. The implementation is publicly available at https://github.com/Xinzhe99/StackMFF-Series.

[75] SVBench: Evaluation of Video Generation Models on Social Reasoning

Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang

Main category: cs.CV

TL;DR: Researchers create first benchmark for evaluating social reasoning in video generation, revealing major gaps in AI models’ ability to generate socially coherent behavior compared to humans.

DetailsMotivation: Current text-to-video generation models excel in visual realism and motion fidelity but fundamentally lack social reasoning capabilities - they can't infer intentions, beliefs, emotions, or social norms like humans do from visual cues.

Method: Developed a training-free agent-based pipeline that: 1) distills reasoning mechanisms from 30 classic social cognition paradigms across 7 dimensions, 2) synthesizes diverse video scenarios, 3) enforces conceptual neutrality through cue-based critique, and 4) evaluates videos using VLM judges across 5 interpretable social reasoning dimensions.

Result: Large-scale evaluation of 7 state-of-the-art video generation systems shows substantial performance gaps - models excel in surface-level plausibility but systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

Conclusion: There’s a critical need to develop video generation models with social reasoning capabilities, as current systems fundamentally lack the ability to generate socially coherent behavior despite advances in visual quality and motion fidelity.

Abstract: Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

[76] CAE-Net: Generalized Deepfake Image Detection using Convolution and Attention Mechanisms with Spatial and Frequency Domain Features

Anindya Bhattacharjee, Kaidul Islam, Kafi Anan, Ashir Intesher, Abrar Assaeem Fuad, Utsab Saha, Hafiz Imtiaz

Main category: cs.CV

TL;DR: CAE-Net is a deepfake detection model combining spatial and frequency features using EfficientNet, DeiT, and ConvNeXt with wavelet features, trained with a novel disjoint-subset strategy to handle class imbalance, achieving 94.46% accuracy on DF-Wild Cup dataset.

DetailsMotivation: Deepfake detection faces challenges from diverse generation techniques and severe class imbalance in datasets (5:1 fake-to-real ratio), requiring robust and generalized detection methods that can handle these issues effectively.

Method: Proposes CAE-Net: a Convolution- and Attention-based weighted Ensemble network combining spatial (EfficientNet, ConvNeXt) and frequency-domain (wavelet features) representations with attention mechanisms (DeiT). Introduces multistage disjoint-subset training strategy to handle class imbalance by sequentially training on non-overlapping fake subsets while retaining knowledge across stages.

Result: Achieved 94.46% accuracy and 97.60% AUC on the DF-Wild Cup dataset, outperforming conventional class-balancing methods. Visualizations show the network focuses on meaningful facial regions, and ensemble design demonstrates robustness against adversarial attacks.

Conclusion: CAE-Net provides a dependable and generalized deepfake detection framework that effectively addresses class imbalance through innovative training strategies and combines complementary spatial-frequency features for robust performance against diverse deepfake generation techniques.

Abstract: The spread of deepfakes poses significant security concerns, demanding reliable detection methods. However, diverse generation techniques and class imbalance in datasets create challenges. We propose CAE-Net, a Convolution- and Attention-based weighted Ensemble network combining spatial and frequency-domain features for effective deepfake detection. The architecture integrates EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet features to learn complementary representations. We evaluated CAE-Net on the diverse IEEE Signal Processing Cup 2025 (DF-Wild Cup) dataset, which has a 5:1 fake-to-real class imbalance. To address this, we introduce a multistage disjoint-subset training strategy, sequentially training the model on non-overlapping subsets of the fake class while retaining knowledge across stages. Our approach achieved $94.46%$ accuracy and a $97.60%$ AUC, outperforming conventional class-balancing methods. Visualizations confirm the network focuses on meaningful facial regions, and our ensemble design demonstrates robustness against adversarial attacks, positioning CAE-Net as a dependable and generalized deepfake detection framework.

[77] Fixed-Budget Parameter-Efficient Training with Frozen Encoders Improves Multimodal Chest X-Ray Classification

Md Ashik Khan, Md Nahid Siddique

Main category: cs.CV

TL;DR: Parameter-efficient training methods (frozen encoders, LoRA, adapters) achieve better chest X-ray classification performance with 40x fewer parameters than full fine-tuning, though they require calibration correction for clinical use.

DetailsMotivation: Multimodal chest X-ray analysis typically requires computationally expensive fine-tuning of large vision-language models. The paper aims to explore parameter-efficient training strategies to reduce computational costs while maintaining performance.

Method: The study evaluates various parameter-efficient training (PET) strategies including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on chest X-ray datasets. To prevent data leakage, pathology terms were redacted from reports while retaining clinical context. Experiments were conducted on Indiana University Chest X-Ray dataset (3,851 pairs) and validated on CheXpert (224,316 images).

Result: All PET methods achieved AUROC between 0.892-0.908 with only 2.37M parameters (2.51% of total), outperforming full fine-tuning (0.770 AUROC) which used 94.3M parameters. External validation confirmed scalability, with Adapter achieving best performance (0.7214 AUROC). However, PET methods showed degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049).

Conclusion: Frozen encoder strategies provide superior discrimination at substantially reduced computational cost, but calibration correction is essential for clinical deployment. Improvements come primarily from parameter allocation rather than cross-modal synergy.

Abstract: Multimodal chest X-Ray analysis often fine-tunes large vision-language models, which is computationally costly. We study parameter-efficient training (PET) strategies, including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on the Indiana University Chest X-Ray dataset (3,851 image-report pairs; 579 test samples). To mitigate data leakage, we redact pathology terms from reports used as text inputs while retaining clinical context. Under a fixed parameter budget (2.37M parameters, 2.51% of total), all PET variants achieve AUROC between 0.892 and 0.908, outperforming full fine-tuning (0.770 AUROC), which uses 94.3M trainable parameters, a 40x reduction. External validation on CheXpert (224,316 images, 58x larger) confirms scalability: all PET methods achieve >0.69 AUROC with <9% trainable parameters, with Adapter achieving best performance (0.7214 AUROC). Budget-matched comparisons reveal that vision-only models (0.653 AUROC, 1.06M parameters) outperform budget-matched multimodal models (0.641 AUROC, 1.06M parameters), indicating improvements arise primarily from parameter allocation rather than cross-modal synergy. While PET methods show degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049), this represents a tractable limitation addressable through post-hoc calibration methods. These findings demonstrate that frozen encoder strategies provide superior discrimination at substantially reduced computational cost, though calibration correction is essential for clinical deployment.

[78] Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

Binfeng Wang, Di Wang, Haonan Guo, Ying Fu, Jing Zhang

Main category: cs.CV

TL;DR: DAMP is a unified HSI restoration framework that uses degradation metrics as prompts instead of explicit degradation priors, enabling adaptive restoration under diverse degradations.

DetailsMotivation: Existing unified HSI restoration methods rely on explicit degradation priors/labels as prompts, which are difficult to obtain in real-world scenarios with complex mixed degradations.

Method: Proposes Degradation-Aware Metric Prompting (DAMP) with: 1) Spatial-spectral degradation metrics to quantify degradations as Degradation Prompts, 2) Spatial-Spectral Adaptive Module (SSAM) for dynamic feature modulation, 3) Mixture-of-Experts architecture using DP as gating router.

Result: Extensive experiments on natural and remote sensing HSI datasets show state-of-the-art performance and exceptional generalization capability under diverse, mixed, or unseen degradations.

Conclusion: DAMP provides a practical unified HSI restoration solution that doesn’t require explicit degradation priors, enabling robust performance across various degradation scenarios with strong generalization.

Abstract: Unified hyperspectral image (HSI) restoration aims to recover various degraded HSIs using a single model, offering great practical value. However, existing methods often depend on explicit degradation priors (e.g., degradation labels) as prompts to guide restoration, which are difficult to obtain due to complex and mixed degradations in real-world scenarios. To address this challenge, we propose a Degradation-Aware Metric Prompting (DAMP) framework. Instead of relying on predefined degradation priors, we design spatial-spectral degradation metrics to continuously quantify multi-dimensional degradations, serving as Degradation Prompts (DP). These DP enable the model to capture cross-task similarities in degradation distributions and enhance shared feature learning. Furthermore, we introduce a Spatial-Spectral Adaptive Module (SSAM) that dynamically modulates spatial and spectral feature extraction through learnable parameters. By integrating SSAM as experts within a Mixture-of-Experts architecture, and using DP as the gating router, the framework enables adaptive, efficient, and robust restoration under diverse, mixed, or unseen degradations. Extensive experiments on natural and remote sensing HSI datasets show that DAMP achieves state-of-the-art performance and demonstrates exceptional generalization capability. Code is publicly available at https://github.com/MiliLab/DAMP.

[79] Fixed-Threshold Evaluation of a Hybrid CNN-ViT for AI-Generated Image Detection Across Photos and Art

Md Ashik Khan, Arafat Alam Jion

Main category: cs.CV

TL;DR: The paper introduces a fixed-threshold evaluation protocol for AI-generated image detectors to prevent misleading robustness estimates, revealing that ViTs maintain performance under compression while CNNs degrade, and semantic patterns provide more reliable detection cues than forensic artifacts.

DetailsMotivation: Existing AI-generated image detection methods produce misleading robustness estimates by retuning decision thresholds for each post-processing condition, which artificially inflates performance metrics and masks deployment failures. There's a need for deployment-relevant evaluation that maintains fixed thresholds across transformations.

Method: Introduced a fixed-threshold evaluation protocol that selects decision thresholds once on clean validation data and holds them fixed across all post-processing transformations. Used a lightweight CNN-ViT hybrid architecture with gated fusion and optional frequency enhancement, evaluated at three operating points (Low-FPR, ROC-optimal, Best-F1) under systematic degradation testing.

Result: Revealed a forensic-semantic spectrum: frequency-aided CNNs excel on pristine photos (93.33%) but collapse under compression (61.49%), while ViTs degrade minimally (92.86% to 88.36%). All architectures achieved 15% higher AUROC on artistic content (0.901-0.907) vs photorealistic images (0.747-0.759). Hybrid approach achieved balanced performance: 91.4% accuracy on photos, 89.7% on art/graphics, and 98.3% on CIFAKE.

Conclusion: Fixed-threshold evaluation eliminates retuning inflation and reveals genuine robustness gaps, providing actionable deployment guidance: prefer CNNs for clean photo verification, ViTs for compressed content, and hybrids for art/graphics screening. Semantic patterns provide fundamentally more reliable detection cues than forensic artifacts.

Abstract: AI image generators create both photorealistic images and stylized art, necessitating robust detectors that maintain performance under common post-processing transformations (JPEG compression, blur, downscaling). Existing methods optimize single metrics without addressing deployment-critical factors such as operating point selection and fixed-threshold robustness. This work addresses misleading robustness estimates by introducing a fixed-threshold evaluation protocol that holds decision thresholds, selected once on clean validation data, fixed across all post-processing transformations. Traditional methods retune thresholds per condition, artificially inflating robustness estimates and masking deployment failures. We report deployment-relevant performance at three operating points (Low-FPR, ROC-optimal, Best-F1) under systematic degradation testing using a lightweight CNN-ViT hybrid with gated fusion and optional frequency enhancement. Our evaluation exposes a statistically validated forensic-semantic spectrum: frequency-aided CNNs excel on pristine photos but collapse under compression (93.33% to 61.49%), whereas ViTs degrade minimally (92.86% to 88.36%) through robust semantic pattern recognition. Multi-seed experiments demonstrate that all architectures achieve 15% higher AUROC on artistic content (0.901-0.907) versus photorealistic images (0.747-0.759), confirming that semantic patterns provide fundamentally more reliable detection cues than forensic artifacts. Our hybrid approach achieves balanced cross-domain performance: 91.4% accuracy on tiny-genimage photos, 89.7% on AiArtData art/graphics, and 98.3% (competitive) on CIFAKE. Fixed-threshold evaluation eliminates retuning inflation, reveals genuine robustness gaps, and yields actionable deployment guidance: prefer CNNs for clean photo verification, ViTs for compressed content, and hybrids for art/graphics screening.

[80] MuS-Polar3D: A Benchmark Dataset for Computational Polarimetric 3D Imaging under Multi-Scattering Conditions

Puyun Wang, Kaimin Yu, Huayang He, Xianyu Wu

Main category: cs.CV

TL;DR: MuS-Polar3D is a benchmark dataset for polarization-based underwater 3D imaging with controlled scattering conditions, enabling fair comparison of methods and achieving 15.49° mean angular error.

DetailsMotivation: Existing polarization-based underwater 3D imaging datasets lack diversity in scattering and observation conditions, hindering fair comparisons between single-view and multi-view methods.

Method: Constructed MuS-Polar3D dataset with 42 objects captured under 7 controlled scattering conditions and 5 viewpoints, including high-precision 3D models, normal maps, and masks. Proposed a two-stage pipeline: descattering followed by 3D reconstruction.

Result: Dataset enables multiple vision tasks and extensive evaluations show best mean angular error of 15.49 degrees. First publicly available benchmark for turbidity underwater polarization-based 3D imaging.

Conclusion: MuS-Polar3D enables accurate 3D reconstruction and fair algorithm evaluation under controllable scattering conditions, advancing polarization-based underwater imaging research.

Abstract: Polarization-based underwater 3D imaging exploits polarization cues to suppress background scattering, exhibiting distinct advantages in turbid water. Although data-driven polarization-based underwater 3D reconstruction methods show great potential, existing public datasets lack sufficient diversity in scattering and observation conditions, hindering fair comparisons among different approaches, including single-view and multi-view polarization imaging methods. To address this limitation, we construct MuS-Polar3D, a benchmark dataset comprising polarization images of 42 objects captured under seven quantitatively controlled scattering conditions and five viewpoints, together with high-precision 3D models (+/- 0.05 mm accuracy), normal maps, and foreground masks. The dataset supports multiple vision tasks, including normal estimation, object segmentation, descattering, and 3D reconstruction. Inspired by computational imaging, we further decouple underwater 3D reconstruction under scattering into a two-stage pipeline, namely descattering followed by 3D reconstruction, from an imaging-chain perspective. Extensive evaluations using multiple baseline methods under complex scattering conditions demonstrate the effectiveness of the proposed benchmark, achieving a best mean angular error of 15.49 degrees. To the best of our knowledge, MuS-Polar3D is the first publicly available benchmark dataset for quantitative turbidity underwater polarization-based 3D imaging, enabling accurate reconstruction and fair algorithm evaluation under controllable scattering conditions. The dataset and code are publicly available at https://github.com/WangPuyun/MuS-Polar3D.

[81] DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO

Henglin Liu, Huijuan Huang, Jing Wang, Chang Liu, Xiu Li, Xiangyang Ji

Main category: cs.CV

TL;DR: Proposes a method to address diversity degradation in GRPO-based image generation by introducing distributional creativity rewards and structure-aware regularization to improve visual diversity while maintaining quality.

DetailsMotivation: Traditional GRPO improves image quality but causes homogenized outputs lacking creativity and visual diversity in later training stages, limiting application scenarios. This stems from single-sample reward signals and misaligned regularization that neglects early-stage denoising's role in preserving diversity.

Method: Two-pronged approach: 1) Distributional creativity bonus using spectral clustering over semantically grouped samples to allocate exploratory rewards based on group sizes, encouraging novel visual modes. 2) Structure-aware regularization that enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency.

Result: Achieves 13-18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.

Conclusion: The proposed method effectively addresses diversity degradation in GRPO by tackling both reward modeling and generation dynamics, enabling better quality-diversity trade-offs in image generation.

Abstract: Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, which restricts its application scenarios. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality–diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves a 13%–18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.

[82] Hierarchy-Aware Fine-Tuning of Vision-Language Models

Jiayu Li, Rajesh Gangireddy, Samet Akcay, Wei Cheng, Juhua Hu

Main category: cs.CV

TL;DR: Efficient hierarchy-aware fine-tuning framework for Vision-Language Models that enforces structural consistency in hierarchical classification with minimal parameter updates.

DetailsMotivation: Standard VLM adaptation for hierarchical classification treats labels as flat categories, requires expensive full fine-tuning, and produces inconsistent predictions across taxonomy levels.

Method: Combines Tree-Path KL Divergence (TP-KL) for vertical coherence along ground-truth label paths and Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) for consistency among sibling classes, integrated with lightweight LoRA adaptation in the VLM’s shared embedding space.

Result: Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead.

Conclusion: Provides an efficient strategy for adapting VLMs to structured taxonomies while maintaining hierarchical consistency.

Abstract: Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM’s shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.

[83] Vision Transformers are Circulant Attention Learners

Dongchen Han, Tianyu Li, Ziyi Wang, Gao Huang

Main category: cs.CV

TL;DR: Circulant Attention: A novel attention mechanism that models self-attention as Block Circulant matrices to achieve O(N log N) complexity while maintaining model capacity, addressing quadratic complexity issues in vision Transformers.

DetailsMotivation: Self-attention's quadratic complexity creates computational burdens in high-resolution vision tasks. Previous methods using handcrafted patterns (locality/sparsity) compromise model capacity. The paper aims to find an efficient attention mechanism that preserves capacity while reducing complexity.

Method: Identifies that self-attention matrices in vision Transformers approximate Block Circulant matrices with Circulant Blocks (BCCB). Models attention maps as nearest BCCB matrices and develops efficient O(N log N) computation algorithms using structured matrix properties.

Result: Achieves O(N log N) computational complexity (vs quadratic), maintains capacity close to standard self-attention, and demonstrates effectiveness across diverse visual tasks through extensive experiments.

Conclusion: Circulant Attention provides an efficient alternative to self-attention for vision Transformers, balancing computational efficiency with model capacity preservation, making it promising for practical high-resolution applications.

Abstract: The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed \textbf{Circulant Attention} by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $\mathcal{O}(N\log N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $\mathcal{O}(N\log N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on diverse visual tasks demonstrate the effectiveness of our approach, establishing circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code is available at https://github.com/LeapLabTHU/Circulant-Attention.

[84] EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

Sanghyun Jo, Donghwan Lee, Eunji Jung, Seong Je Oh, Kyungsu Kim

Main category: cs.CV

TL;DR: EraseLoRA: A dataset-free framework for object removal that uses background-aware reasoning and test-time adaptation instead of attention manipulation, achieving better results than existing methods.

DetailsMotivation: Object removal requires preventing target reappearance and reconstructing occluded background with structural fidelity, not just plausible hole filling. Current dataset-free methods that manipulate self-attention fail by misinterpreting non-target foregrounds as background (causing unwanted object regeneration) and disrupting fine details through direct attention intervention.

Method: Two-stage approach: 1) Background-aware Foreground Exclusion (BFE) uses multimodal LLMs to separate target foreground, non-target foregrounds, and clean background from single image-mask pair without supervision. 2) Background-aware Reconstruction with Subtype Aggregation (BRSA) performs test-time optimization treating background subtypes as complementary pieces, enforcing consistent integration through reconstruction and alignment objectives without explicit attention intervention.

Result: Validated as plug-in to pretrained diffusion models across object removal benchmarks, showing consistent improvements over dataset-free baselines and competitive results against dataset-driven methods.

Conclusion: EraseLoRA demonstrates that replacing attention surgery with background-aware reasoning and test-time adaptation enables effective object removal without dataset dependency, addressing key limitations of current approaches while maintaining structural and contextual fidelity.

Abstract: Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision, producing reliable background cues while excluding distractors. Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces and enforces their consistent integration through reconstruction and alignment objectives, preserving local detail and global structure without explicit attention intervention. We validate EraseLoRA as a plug-in to pretrained diffusion models and across benchmarks for object removal, demonstrating consistent improvements over dataset-free baselines and competitive results against dataset-driven methods. The code will be made available upon publication.

[85] Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration

Unnati Saraswat, Tarun Rao, Namah Gupta, Shweta Swami, Shikhar Sharma, Prateek Narang, Dhruv Kumar

Main category: cs.CV

TL;DR: The paper introduces two new tasks for advertising/digital media: context-aware object insertion and sponsor-product logo augmentation, with corresponding datasets.

DetailsMotivation: Current image editing methods using VLMs and diffusion models often fail to ensure contextual appropriateness of inserted objects, particularly for advertising and digital media applications.

Method: Proposes two new tasks: (1) context-aware object insertion requiring category prediction, generation, and plausible placement; (2) sponsor-product logo augmentation involving product detection and correct brand logo insertion. Built two new datasets with category annotations, placement regions, and sponsor-product labels.

Result: Two new datasets created to support the proposed tasks, containing comprehensive annotations for category, placement regions, and sponsor-product relationships.

Conclusion: The paper addresses a gap in contextual appropriateness for image editing in advertising/digital media by defining new tasks and providing datasets to enable research in this direction.

Abstract: Intelligent image editing increasingly relies on advances in computer vision, multimodal reasoning, and generative modeling. While vision-language models (VLMs) and diffusion models enable guided visual manipulation, existing work rarely ensures that inserted objects are \emph{contextually appropriate}. We introduce two new tasks for advertising and digital media: (1) \emph{context-aware object insertion}, which requires predicting suitable object categories, generating them, and placing them plausibly within the scene; and (2) \emph{sponsor-product logo augmentation}, which involves detecting products and inserting correct brand logos, even when items are unbranded or incorrectly branded. To support these tasks, we build two new datasets with category annotations, placement regions, and sponsor-product labels.

[86] Exploration of Reproducible Generated Image Detection

Yihang Duan

Main category: cs.CV

TL;DR: This paper analyzes reproducibility and generalizability issues in AI-Generated Content (AIGC) image detection, identifying key problems and providing empirical evidence for improvement.

DetailsMotivation: The field of AIGC image detection faces two core problems: poor reproducibility and insufficient generalizability, which hinder practical application. The study aims to address these challenges by examining existing research and identifying root causes.

Method: The researchers reviewed 7 key papers on AIGC detection, constructed a lightweight test dataset, and reproduced a representative detection method. They analyzed reproducibility issues by examining omitted details and overfitting patterns.

Result: Basic performance can be reproduced when strictly following core procedures, but detection performance drops sharply when preprocessing disrupts key features or when testing across different generators. The study identified two root causes: papers omitting implicit details (preprocessing, parameters) and methods overfitting to exclusive generator features rather than learning universal AIGC characteristics.

Conclusion: This research provides empirical evidence for improving AIGC detection reproducibility and offers reference directions for researchers to disclose experimental details more comprehensively and verify method generalizability across different generators.

Abstract: While the technology for detecting AI-Generated Content (AIGC) images has advanced rapidly, the field still faces two core issues: poor reproducibility and insufficient gen eralizability, which hinder the practical application of such technologies. This study addresses these challenges by re viewing 7 key papers on AIGC detection, constructing a lightweight test dataset, and reproducing a representative detection method. Through this process, we identify the root causes of the reproducibility dilemma in the field: firstly, papers often omit implicit details such as prepro cessing steps and parameter settings; secondly, most detec tion methods overfit to exclusive features of specific gener ators rather than learning universal intrinsic features of AIGC images. Experimental results show that basic perfor mance can be reproduced when strictly following the core procedures described in the original papers. However, de tection performance drops sharply when preprocessing dis rupts key features or when testing across different genera tors. This research provides empirical evidence for improv ing the reproducibility of AIGC detection technologies and offers reference directions for researchers to disclose ex perimental details more comprehensively and verify the generalizability of their proposed methods.

[87] Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao, Jianxin Li

Main category: cs.CV

TL;DR: LAid: A knowledge distillation method that transfers long-range attention mechanisms from large to small vision-language models, enabling up to 3.2x longer effective context windows while maintaining performance on standard benchmarks.

DetailsMotivation: Small vision-language models (VLMs) have limited window sizes that fail at linguistics-photography alignment, despite large VLMs demonstrating strong long-context understanding. The challenge is to transfer long-range attention capabilities from large to small models.

Method: LAid uses two complementary components: (1) progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed.

Result: LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis shows LAid preserves crucial low-frequency attention components that conventional methods fail to transfer.

Conclusion: LAid provides practical techniques for building more efficient long-context VLMs and offers theoretical insights into how positional understanding emerges and transfers during distillation, addressing the window size limitations of small VLMs.

Abstract: While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students’ capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

[88] LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

Shinnosuke Hirano, Yuiga Wada, Kazuki Matsuda, Seitaro Otsuki, Komei Sugiura

Main category: cs.CV

TL;DR: Pearl is an LLM-free supervised metric for image caption evaluation that works in both reference-based and reference-free settings, outperforming existing LLM-free metrics on multiple datasets.

DetailsMotivation: Existing LLM-based metrics show bias by favoring their own generations, while most LLM-free metrics lack high performance despite being more neutral. There's a need for a metric that combines neutrality with strong evaluation capabilities.

Method: Proposes Pearl, an LLM-free supervised metric that learns representations of image-caption and caption-caption similarities. Also constructs a large human-annotated dataset with ~333k judgments from 2,360 annotators across 75k+ images.

Result: Pearl outperforms other LLM-free metrics on Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings.

Conclusion: Pearl provides a high-performing, neutral alternative to LLM-based metrics for image caption evaluation, with applicability to both reference-based and reference-free scenarios, supported by extensive human annotation data.

Abstract: We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image–caption and caption–caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.

[89] UltraLBM-UNet: Ultralight Bidirectional Mamba-based Model for Skin Lesion Segmentation

Linxuan Fan, Juntao Jiang, Weixuan Liu, Zhucun Xue, Jiajun Lv, Jiangning Zhang, Yong Liu

Main category: cs.CV

TL;DR: UltraLBM-UNet: A lightweight U-Net variant for skin lesion segmentation using bidirectional Mamba-based global modeling with multi-branch local feature perception, achieving SOTA performance with minimal parameters (0.034M) and computational cost (0.060 GFLOPs).

DetailsMotivation: Existing skin lesion segmentation methods have limitations in accuracy, robustness, and computational efficiency, making them unsuitable for point-of-care deployment in clinical settings where resource-efficient yet accurate lesion analysis is crucial.

Method: Proposes UltraLBM-UNet, a lightweight U-Net variant that integrates bidirectional Mamba-based global modeling mechanism with multi-branch local feature perception. Combines efficient local feature injection with bidirectional state-space modeling for rich contextual interaction while maintaining computational compactness. Also introduces hybrid knowledge distillation to train an ultra-compact student model (UltraLBM-UNet-T).

Result: Extensive experiments on ISIC 2017, ISIC 2018, and PH2 datasets show state-of-the-art segmentation accuracy, outperforming existing lightweight and Mamba counterparts with only 0.034M parameters and 0.060 GFLOPs. The distilled variant UltraLBM-UNet-T achieves competitive performance with only 0.011M parameters and 0.019 GFLOPs.

Conclusion: UltraLBM-UNet demonstrates suitability for point-of-care deployment by achieving accurate and robust lesion segmentation with minimal computational resources, addressing the need for efficient clinical decision-making tools in dermatology.

Abstract: Skin lesion segmentation is a crucial step in dermatology for guiding clinical decision-making. However, existing methods for accurate, robust, and resource-efficient lesion analysis have limitations, including low performance and high computational complexity. To address these limitations, we propose UltraLBM-UNet, a lightweight U-Net variant that integrates a bidirectional Mamba-based global modeling mechanism with multi-branch local feature perception. The proposed architecture integrates efficient local feature injection with bidirectional state-space modeling, enabling richer contextual interaction across spatial dimensions while maintaining computational compactness suitable for point-of-care deployment. Extensive experiments on the ISIC 2017, ISIC 2018, and PH2 datasets demonstrate that our model consistently achieves state-of-the-art segmentation accuracy, outperforming existing lightweight and Mamba counterparts with only 0.034M parameters and 0.060 GFLOPs. In addition, we introduce a hybrid knowledge distillation strategy to train an ultra-compact student model, where the distilled variant UltraLBM-UNet-T, with only 0.011M parameters and 0.019 GFLOPs, achieves competitive segmentation performance. These results highlight the suitability of UltraLBM-UNet for point-of-care deployment, where accurate and robust lesion analyses are essential. The source code is publicly available at https://github.com/LinLinLin-X/UltraLBM-UNet.

[90] From Shallow Humor to Metaphor: Towards Label-Free Harmful Meme Detection via LMM Agent Self-Improvement

Jian Lang, Rongpei Hong, Ting Zhong, Leiting Chen, Qiang Gao, Fan Zhou

Main category: cs.CV

TL;DR: ALARM is a label-free harmful meme detection framework using LMM agent self-improvement that leverages explicit memes to iteratively learn to detect complex harmful content without manual annotation.

DetailsMotivation: Current harmful meme detection methods require large-scale labeled data with substantial manual annotation efforts, limiting adaptability to continually evolving harmful content in dynamic online environments.

Method: ALARM uses Confidence-based Explicit Meme Identification to isolate explicit memes with pseudo-labels, then employs Pairwise Learning Guided Agent Self-Improvement where explicit memes are organized into contrastive pairs to refine a learner LMM agent that autonomously derives detection cues.

Result: Experiments on three diverse datasets show superior performance and strong adaptability to newly evolved memes, even outperforming label-driven methods.

Conclusion: ALARM demonstrates the potential of label-free frameworks as scalable solutions for adapting to novel forms and topics of harmful memes in dynamic online environments.

Abstract: The proliferation of harmful memes on online media poses significant risks to public health and stability. Existing detection methods heavily rely on large-scale labeled data for training, which necessitates substantial manual annotation efforts and limits their adaptability to the continually evolving nature of harmful content. To address these challenges, we present ALARM, the first lAbeL-free hARmful Meme detection framework powered by Large Multimodal Model (LMM) agent self-improvement. The core innovation of ALARM lies in exploiting the expressive information from “shallow” memes to iteratively enhance its ability to tackle more complex and subtle ones. ALARM consists of a novel Confidence-based Explicit Meme Identification mechanism that isolates the explicit memes from the original dataset and assigns them pseudo-labels. Besides, a new Pairwise Learning Guided Agent Self-Improvement paradigm is introduced, where the explicit memes are reorganized into contrastive pairs (positive vs. negative) to refine a learner LMM agent. This agent autonomously derives high-level detection cues from these pairs, which in turn empower the agent itself to handle complex and challenging memes effectively. Experiments on three diverse datasets demonstrate the superior performance and strong adaptability of ALARM to newly evolved memes. Notably, our method even outperforms label-driven methods. These results highlight the potential of label-free frameworks as a scalable and promising solution for adapting to novel forms and topics of harmful memes in dynamic online environments.

[91] GaussianEM: Model compositional and conformational heterogeneity using 3D Gaussians

Bintao He, Yiran Cheng, Hongjia Li, Xiang Gao, Xin Gao, Fa Zhang, Renmin Han

Main category: cs.CV

TL;DR: GaussianEM is a cryo-EM analysis framework using Gaussian pseudo-atomic models to simultaneously handle compositional and conformational heterogeneity, bridging density-based and atomic models.

DetailsMotivation: Understanding protein flexibility and dynamics is crucial for protein function study. Cryo-EM enables observation of macromolecular dynamics, but analyzing datasets with both continuous motions and discrete states remains challenging.

Method: GaussianEM uses a Gaussian pseudo-atomic framework with a two-encoder-one-decoder architecture to map cryo-EM images to individual Gaussian components, representing structural variability through changes in Gaussian parameters.

Result: The approach provides intuitive description of conformational changes, preserves local structural consistency along transition trajectories, and bridges gap between density-based and atomic models.

Conclusion: GaussianEM demonstrates effectiveness on both simulated and experimental datasets, offering a powerful tool for analyzing cryo-EM data with complex heterogeneity.

Abstract: Understanding protein flexibility and its dynamic interactions with other molecules is essential for protein function study. Cryogenic electron microscopy (cryo-EM) provides an opportunity to directly observe macromolecular dynamics. However, analyzing datasets that contain both continuous motions and discrete states remains highly challenging. Here we present GaussianEM, a Gaussian pseudo-atomic framework that simultaneously models compositional and conformational heterogeneity from experimental cryo-EM images. GaussianEM employs a two-encoder-one-decoder architecture to map an image to its individual Gaussian components, and represent structural variability through changes in Gaussian parameters. This approach provides an intuitive and interpretable description of conformational changes, preserves local structural consistency along the transition trajectories, and naturally bridges the gap between density-based models and corresponding atomic models. We demonstrate the effectiveness of GaussianEM on both simulated and experimental datasets.

[92] TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant

Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, Fan Zhou

Main category: cs.CV

TL;DR: LCMP introduces the first Long-Context MLLM Personalization benchmark to evaluate multimodal LLMs on long-context personalized dialogues, with TAME as a training-free baseline using double memories and RA2G for context-aware responses.

DetailsMotivation: Existing MLLM personalization methods focus only on simple visual identification and textual replacement of personalized concepts, lacking support for long-context conversations where an ideal assistant should engage in extended dialogues and learn from past histories.

Method: Proposes LCMP benchmark for evaluating long-context personalization, and TAME framework with double memories (temporal and persistent) to manage concept variations, plus RA2G (Retrieve-then-Align Augmented Generation) for extracting context-fitted information from retrieved knowledge.

Result: Experiments on LCMP show TAME achieves best performance, demonstrating remarkable and evolving interaction experiences in long-context scenarios compared to existing methods.

Conclusion: LCMP addresses the gap in long-context MLLM personalization evaluation, and TAME provides an effective training-free solution with state-aware memory management and context alignment for improved personalized dialogue experiences.

Abstract: Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., “A yellow puppy” -> “Your puppy Mochi”), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.

[93] CausalFSFG: Rethinking Few-Shot Fine-Grained Visual Categorization from Causal Perspective

Zhiwen Yang, Jinglin Xu, Yuxin Pen

Main category: cs.CV

TL;DR: A causal inference approach for few-shot fine-grained visual categorization that addresses biased data distributions through causal intervention to eliminate spurious correlations.

DetailsMotivation: Existing FS-FGVC methods overlook that support samples act as confounding variables, introducing biased data distribution and misleading discriminative feature extraction, which hampers performance.

Method: Proposes CausalFSFG with two key components: 1) Interventional multi-scale encoder (IMSE) for sample-level interventions, and 2) Interventional masked feature reconstruction (IMFR) for feature-level interventions, both based on structural causal modeling to reveal real causalities.

Result: Achieves new state-of-the-art performance on widely-used datasets including CUB-200-2011, Stanford Dogs, and Stanford Cars through extensive experiments and thorough analyses.

Conclusion: The causal inference approach effectively addresses biased data distributions in FS-FGVC by eliminating spurious correlations through causal intervention, leading to superior classification performance.

Abstract: Few-shot fine-grained visual categorization (FS-FGVC) focuses on identifying various subcategories within a common superclass given just one or few support examples. Most existing methods aim to boost classification accuracy by enriching the extracted features with discriminative part-level details. However, they often overlook the fact that the set of support samples acts as a confounding variable, which hampers the FS-FGVC performance by introducing biased data distribution and misguiding the extraction of discriminative features. To address this issue, we propose a new causal FS-FGVC (CausalFSFG) approach inspired by causal inference for addressing biased data distributions through causal intervention. Specifically, based on the structural causal model (SCM), we argue that FS-FGVC infers the subcategories (i.e., effect) from the inputs (i.e., cause), whereas both the few-shot condition disturbance and the inherent fine-grained nature (i.e., large intra-class variance and small inter-class variance) lead to unobservable variables that bring spurious correlations, compromising the final classification performance. To further eliminate the spurious correlations, our CausalFSFG approach incorporates two key components: (1) Interventional multi-scale encoder (IMSE) conducts sample-level interventions, (2) Interventional masked feature reconstruction (IMFR) conducts feature-level interventions, which together reveal real causalities from inputs to subcategories. Extensive experiments and thorough analyses on the widely-used public datasets, including CUB-200-2011, Stanford Dogs, and Stanford Cars, demonstrate that our CausalFSFG achieves new state-of-the-art performance. The code is available at https://github.com/PKU-ICST-MIPL/CausalFSFG_TMM.

[94] SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Zhiyuan Liu, Daocheng Fu, Pinlong Cai, Lening Wang, Ying Liu, Yilong Ren, Botian Shi, Jianqiang Wang

Main category: cs.CV

TL;DR: SymDrive: A unified diffusion-based framework for high-fidelity 3D simulation in autonomous driving that enables both photorealistic novel view rendering and interactive traffic editing without geometric or lighting artifacts.

DetailsMotivation: Existing methods for 3D simulation in autonomous driving fail to simultaneously achieve photorealistic rendering and interactive traffic editing, struggling with large-angle novel view synthesis and suffering from geometric/lighting artifacts during asset manipulation.

Method: Introduces Symmetric Auto-regressive Online Restoration paradigm that constructs paired symmetric views to recover fine-grained details via ground-truth-guided dual-view formulation, and uses auto-regressive strategy for consistent lateral view generation. Also includes training-free harmonization mechanism that treats vehicle insertion as context-aware inpainting for seamless lighting and shadow consistency.

Result: Extensive experiments demonstrate state-of-the-art performance in both novel-view enhancement and realistic 3D vehicle insertion, achieving high-quality rendering and scene editing simultaneously.

Conclusion: SymDrive successfully addresses the limitations of existing methods by providing a unified framework for joint high-quality rendering and scene editing in autonomous driving simulation, enabling both photorealistic novel view synthesis and realistic traffic manipulation.

Abstract: High-fidelity and controllable 3D simulation is essential for addressing the long-tail data scarcity in Autonomous Driving (AD), yet existing methods struggle to simultaneously achieve photorealistic rendering and interactive traffic editing. Current approaches often falter in large-angle novel view synthesis and suffer from geometric or lighting artifacts during asset manipulation. To address these challenges, we propose SymDrive, a unified diffusion-based framework capable of joint high-quality rendering and scene editing. We introduce a Symmetric Auto-regressive Online Restoration paradigm, which constructs paired symmetric views to recover fine-grained details via a ground-truth-guided dual-view formulation and utilizes an auto-regressive strategy for consistent lateral view generation. Furthermore, we leverage this restoration capability to enable a training-free harmonization mechanism, treating vehicle insertion as context-aware inpainting to ensure seamless lighting and shadow consistency. Extensive experiments demonstrate that SymDrive achieves state-of-the-art performance in both novel-view enhancement and realistic 3D vehicle insertion.

[95] Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints

Mutiara Shabrina, Nova Kurnia Putri, Jefri Satria Ferdiansyah, Sabita Khansa Dewi, Novanto Yudistira

Main category: cs.CV

TL;DR: The paper analyzes the PPE framework for disentangled image editing, identifies limitations in its regularization strategy, and proposes L1 sparsity constraints to reduce semantic leakage while preserving identity.

DetailsMotivation: Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute unintentionally alters other semantic properties like identity or appearance. The PPE framework addresses this but has limitations in its regularization approach.

Method: Analyzes the PPE framework’s BERT-based attribute prediction and StyleGAN2-based image generation on CelebA-HQ. Identifies limitation in original regularization where latent updates remain dense. Proposes sparsity-based constraint using L1 regularization on latent space manipulation.

Result: Experimental results show the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.

Conclusion: L1 sparsity regularization improves the PPE framework by enabling more disentangled image editing with reduced semantic leakage, achieving better attribute-specific modifications while maintaining identity preservation.

Abstract: Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.

[96] TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

Jiahong Yu, Ziqi Wang, Hailiang Zhao, Wei Zhai, Xueqiang Yan, Shuiguang Deng

Main category: cs.CV

TL;DR: TrackTeller is a temporal multimodal framework for 3D object grounding in driving scenes that uses LiDAR-image fusion and temporal reasoning to understand language references involving motion and interactions.

DetailsMotivation: Many referring expressions in driving scenes describe objects through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. Current methods lack temporal reasoning capabilities for understanding dynamic language references.

Method: TrackTeller integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. It constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics.

Result: On the NuPrompt benchmark, TrackTeller consistently improves language-grounded tracking performance, achieving 70% relative improvement in Average Multi-Object Tracking Accuracy and 3.15-3.4 times reduction in False Alarm Frequency compared to strong baselines.

Conclusion: Temporal reasoning is crucial for understanding natural language references in dynamic 3D scenes, and TrackTeller’s unified multimodal approach effectively addresses the challenge of grounding language expressions involving motion and interactions.

Abstract: Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.

[97] Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding

Zhiwang Zhou, Yuandong Pu, Xuming He, Yidi Liu, Yixin Chen, Junchao Gong, Xiang Zhuang, Wanghan Xu, Qinglong Cao, Shixiang Tang, Yihao Liu, Wenlong Zhang, Lei Bai

Main category: cs.CV

TL;DR: Omni-Weather is a multimodal foundation model that unifies weather generation and understanding in a single architecture, achieving SOTA performance in both tasks through shared processing and causal reasoning.

DetailsMotivation: Existing weather modeling methods treat accurate prediction and mechanistic interpretation in isolation, separating generation from understanding. There's a need for a unified approach that addresses both goals simultaneously.

Method: Omni-Weather integrates a radar encoder for weather generation tasks with unified processing using shared self-attention mechanisms. The authors construct a Chain-of-Thought dataset for causal reasoning in weather generation to enable interpretable outputs and improved perceptual quality.

Result: Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. The findings indicate that generative and understanding tasks in weather domain can mutually enhance each other.

Conclusion: Omni-Weather demonstrates the feasibility and value of unifying weather generation and understanding within a single multimodal foundation model architecture.

Abstract: Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture. Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism. Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality. Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other. Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.

[98] The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds

Subramanyam Sahoo, Jared Junkin

Main category: cs.CV

TL;DR: A mechanistic interpretability framework combining sparse autoencoder analysis and forensic manifold analysis to understand deepfake detection models’ decision processes.

DetailsMotivation: Deepfake detection models achieve high accuracy but remain opaque "black boxes" - their decision processes are not understood, limiting interpretability and robustness.

Method: Combines sparse autoencoder (SAE) analysis of internal network representations with novel forensic manifold analysis that probes feature responses to controlled forensic artifact manipulations.

Result: Only a small fraction of latent features are actively used in each layer, and geometric properties of the feature manifold (intrinsic dimensionality, curvature, feature selectivity) vary systematically with different deepfake artifacts.

Conclusion: Provides first step toward opening the “black box” of deepfake detectors, enabling identification of which learned features correspond to specific forensic artifacts and guiding development of more interpretable and robust models.

Abstract: Deepfake detection models have achieved high accuracy in identifying synthetic media, but their decision processes remain largely opaque. In this paper we present a mechanistic interpretability framework for deepfake detection applied to a vision-language model. Our approach combines a sparse autoencoder (SAE) analysis of internal network representations with a novel forensic manifold analysis that probes how the model’s features respond to controlled forensic artifact manipulations. We demonstrate that only a small fraction of latent features are actively used in each layer, and that the geometric properties of the model’s feature manifold, including intrinsic dimensionality, curvature, and feature selectivity, vary systematically with different types of deepfake artifacts. These insights provide a first step toward opening the “black box” of deepfake detectors, allowing us to identify which learned features correspond to specific forensic artifacts and to guide the development of more interpretable and robust models.

[99] Comparative Analysis of Deep Learning Models for Perception in Autonomous Vehicles

Jalal Khan

Main category: cs.CV

TL;DR: YOLOv8s outperforms YOLO-NAS in both training efficiency (75% faster) and object detection accuracy (83% vs 81%) for autonomous vehicle perception tasks.

DetailsMotivation: To compare performance of emerging deep learning models (YOLO-NAS and YOLOv8) for object detection in autonomous vehicle perception systems, helping researchers understand real-world performance trade-offs.

Method: Captured custom dataset for autonomous vehicle perception tasks, then experimentally compared YOLO-NAS and YOLOv8 models using this dataset, focusing on training time and detection accuracy metrics.

Result: YOLOv8s achieved 75% faster training time compared to YOLO-NAS, and higher object detection accuracy (83% vs 81%) when aiming for maximum accuracy.

Conclusion: YOLOv8s demonstrates superior efficiency and accuracy for autonomous vehicle perception tasks compared to YOLO-NAS, providing valuable comparative insights for the research community working on real-world AV applications.

Abstract: Recently, a plethora of machine learning (ML) and deep learning (DL) algorithms have been proposed to achieve the efficiency, safety, and reliability of autonomous vehicles (AVs). The AVs use a perception system to detect, localize, and identify other vehicles, pedestrians, and road signs to perform safe navigation and decision-making. In this paper, we compare the performance of DL models, including YOLO-NAS and YOLOv8, for a detection-based perception task. We capture a custom dataset and experiment with both DL models using our custom dataset. Our analysis reveals that the YOLOv8s model saves 75% of training time compared to the YOLO-NAS model. In addition, the YOLOv8s model (83%) outperforms the YOLO-NAS model (81%) when the target is to achieve the highest object detection accuracy. These comparative analyses of these new emerging DL models will allow the relevant research community to understand the models’ performance under real-world use case scenarios.

[100] UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, Yihao Liu

Main category: cs.CV

TL;DR: UniPercept-Bench is a unified framework for evaluating perceptual-level image understanding across aesthetics, quality, structure, and texture domains, with a strong baseline model that outperforms existing MLLMs.

DetailsMotivation: Current multimodal large language models (MLLMs) have made significant progress in visual understanding tasks but lack capabilities in perceiving perceptual-level image features like aesthetics, quality, structure, and texture.

Method: The authors establish a hierarchical definition system, construct large-scale datasets, and develop UniPercept model trained via Domain-Adaptive Pre-Training and Task-Aligned Reinforcement Learning for both Visual Rating and Visual Question Answering tasks.

Result: UniPercept outperforms existing MLLMs on perceptual-level image understanding tasks and can serve as a plug-and-play reward model for text-to-image generation.

Conclusion: This work defines Perceptual-Level Image Understanding in the MLLM era and provides a comprehensive benchmark with strong baseline to advance perceptual-level multimodal image understanding.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.

[101] Contrastive Graph Modeling for Cross-Domain Few-Shot Medical Image Segmentation

Yuntian Bo, Tao Zhou, Zechao Li, Haofeng Zhang, Ling Shao

Main category: cs.CV

TL;DR: C-Graph: Contrastive Graph Modeling framework for cross-domain few-shot medical image segmentation that leverages structural consistency as domain-transferable prior, outperforming existing methods while preserving source-domain accuracy.

DetailsMotivation: Existing cross-domain few-shot medical image segmentation methods filter out domain-specific information to improve generalization, but this limits cross-domain performance and degrades source-domain accuracy. There's a need for a method that can effectively transfer knowledge across domains without sacrificing performance on either domain.

Method: Proposes Contrastive Graph Modeling (C-Graph) framework: 1) Represents image features as graphs with pixels as nodes and semantic affinities as edges; 2) Structural Prior Graph (SPG) layer captures and transfers target-category node dependencies; 3) Subgraph Matching Decoding (SMD) mechanism exploits semantic relations among nodes to guide prediction; 4) Confusion-minimizing Node Contrast (CNC) loss mitigates node ambiguity and subgraph heterogeneity through contrastive learning.

Result: Significantly outperforms prior CD-FSMIS approaches across multiple cross-domain benchmarks, achieving state-of-the-art performance while simultaneously preserving strong segmentation accuracy on the source domain.

Conclusion: C-Graph effectively leverages structural consistency of medical images as a reliable domain-transferable prior, addressing limitations of existing methods by maintaining both cross-domain performance and source-domain accuracy through graph-based modeling and contrastive learning techniques.

Abstract: Cross-domain few-shot medical image segmentation (CD-FSMIS) offers a promising and data-efficient solution for medical applications where annotations are severely scarce and multimodal analysis is required. However, existing methods typically filter out domain-specific information to improve generalization, which inadvertently limits cross-domain performance and degrades source-domain accuracy. To address this, we present Contrastive Graph Modeling (C-Graph), a framework that leverages the structural consistency of medical images as a reliable domain-transferable prior. We represent image features as graphs, with pixels as nodes and semantic affinities as edges. A Structural Prior Graph (SPG) layer is proposed to capture and transfer target-category node dependencies and enable global structure modeling through explicit node interactions. Building upon SPG layers, we introduce a Subgraph Matching Decoding (SMD) mechanism that exploits semantic relations among nodes to guide prediction. Furthermore, we design a Confusion-minimizing Node Contrast (CNC) loss to mitigate node ambiguity and subgraph heterogeneity by contrastively enhancing node discriminability in the graph space. Our method significantly outperforms prior CD-FSMIS approaches across multiple cross-domain benchmarks, achieving state-of-the-art performance while simultaneously preserving strong segmentation accuracy on the source domain.

[102] BeHGAN: Bengali Handwritten Word Generation from Plain Text Using Generative Adversarial Networks

Md. Rakibul Islam, Md. Kamrozzaman Bhuiyan, Safwan Muntasir, Arifur Rahman Jawad, Most. Sharmin Sultana Samu

Main category: cs.CV

TL;DR: Proposes a method for generating Bengali handwritten words using a self-collected dataset from ~500 individuals, addressing the lack of Bengali HTG research despite it being the 5th most spoken language.

DetailsMotivation: Handwritten Text Generation (HTG) is an emerging field with significant potential, but challenging due to handwriting style variations. While HTR is well-established, Bengali HTG has received little attention despite Bengali being the 5th most spoken language. Large diverse datasets needed for realistic generation are difficult to collect and not readily available for Bengali.

Method: Developed and used a self-collected dataset of Bengali handwriting samples from approximately 500 individuals across different ages and genders. All images were pre-processed for consistency and quality. Proposed a method for generating Bengali handwritten words from input plain text.

Result: The approach demonstrates the ability to produce diverse handwritten outputs from input plain text. The work contributes to advancing Bengali handwriting generation and can support further research in this area.

Conclusion: This work addresses the research gap in Bengali handwritten text generation by proposing a method using a self-collected diverse dataset, contributing to the advancement of Bengali HTG and supporting future research in this emerging field.

Abstract: Handwritten Text Recognition (HTR) is a well-established research area. In contrast, Handwritten Text Generation (HTG) is an emerging field with significant potential. This task is challenging due to the variation in individual handwriting styles. A large and diverse dataset is required to generate realistic handwritten text. However, such datasets are difficult to collect and are not readily available. Bengali is the fifth most spoken language in the world. While several studies exist for languages such as English and Arabic, Bengali handwritten text generation has received little attention. To address this gap, we propose a method for generating Bengali handwritten words. We developed and used a self-collected dataset of Bengali handwriting samples. The dataset includes contributions from approximately five hundred individuals across different ages and genders. All images were pre-processed to ensure consistency and quality. Our approach demonstrates the ability to produce diverse handwritten outputs from input plain text. We believe this work contributes to the advancement of Bengali handwriting generation and can support further research in this area.

[103] SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration

Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang

Main category: cs.CV

TL;DR: SlideChain: A blockchain framework for verifiable integrity of multimodal semantic extraction from educational slides, addressing VLM inconsistency issues in STEM domains.

DetailsMotivation: VLMs are increasingly used for educational content but suffer from inconsistencies across models, settings, and environments, undermining reliability in high-stakes STEM domains where verifiable outputs are crucial.

Method: Introduces SlideChain - a blockchain-backed provenance framework that extracts concepts and relational triples from medical imaging lecture slides using 4 state-of-the-art VLMs, constructs structured provenance records, and anchors cryptographic hashes on a local EVM-compatible blockchain.

Result: Reveals pronounced cross-model discrepancies (low concept overlap, near-zero agreement in relational triples), demonstrates perfect tamper detection, deterministic reproducibility, and evaluates gas usage/throughput/scalability under simulated deployment.

Conclusion: SlideChain provides a practical, scalable solution for trustworthy, verifiable multimodal educational pipelines with long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

Abstract: Modern vision–language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

[104] Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective

Huan Li, Longjun Luo, Yuling Shi, Xiaodong Gu

Main category: cs.CV

TL;DR: VGGT’s global self-attention suffers from rank collapse with long sequences; analysis shows it’s a degenerate diffusion process converging to Dirac measure at O(1/L) rate, explaining token-merging remedy.

DetailsMotivation: To understand why VGGT's global self-attention layer experiences drastic collapse when processing long sequences (hundreds of frames), where attention matrices become near rank-one and reconstruction error accumulates super-linearly.

Method: Mathematical analysis viewing global-attention iteration as a degenerate diffusion process, proving token-feature flow converges toward Dirac-type measure at O(1/L) rate, deriving closed-form mean-field PDE that predicts rank profile.

Result: Theory quantitatively matches attention-heat-map evolution and experimental outcomes, explains why token-merging (removing redundant tokens) slows diffusion coefficient and delays collapse without additional training.

Conclusion: Analysis provides principled framework for interpreting scalable 3D-vision transformers, highlights potential for multi-modal generalization, and explains collapse phenomenon in VGGT’s attention mechanism.

Abstract: Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates super-linearly.In this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion process.We prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a $O(1/L)$ rate, where $L$ is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank profile.The theory quantitatively matches the attention-heat-map evolution and a series of experiments outcomes reported in relevant works and explains why its token-merging remedy – which periodically removes redundant tokens – slows the effective diffusion coefficient and thereby delays collapse without additional training.We believe the analysis provides a principled lens for interpreting future scalable 3D-vision transformers,and we highlight its potential for multi-modal generalization.

[105] ShinyNeRF: Digitizing Anisotropic Appearance in Neural Radiance Fields

Albert Barreiro, Roger Marí, Rafael Redondo, Gloria Haro, Carles Bosch

Main category: cs.CV

TL;DR: ShinyNeRF is a novel framework that improves 3D digitization of cultural heritage by accurately modeling both isotropic and anisotropic specular reflections, addressing limitations of existing NeRF methods with brushed metals and similar surfaces.

DetailsMotivation: Existing NeRF methods struggle with accurately modeling anisotropic specular surfaces like brushed metals, which is crucial for high-quality 3D digitization and preservation of cultural heritage artifacts with complex material properties.

Method: ShinyNeRF jointly estimates surface normals, tangents, specular concentration, and anisotropy magnitudes using an Anisotropic Spherical Gaussian (ASG) distribution, approximating outgoing radiance as an encoded mixture of isotropic von Mises-Fisher distributions.

Result: ShinyNeRF achieves state-of-the-art performance in digitizing anisotropic specular reflections while providing plausible physical interpretations and enabling material property editing capabilities.

Conclusion: ShinyNeRF represents a significant advancement in 3D digitization technology, offering improved handling of complex material reflections for cultural heritage preservation and enabling new editing capabilities for material properties.

Abstract: Recent advances in digitization technologies have transformed the preservation and dissemination of cultural heritage. In this vein, Neural Radiance Fields (NeRF) have emerged as a leading technology for 3D digitization, delivering representations with exceptional realism. However, existing methods struggle to accurately model anisotropic specular surfaces, typically observed, for example, on brushed metals. In this work, we introduce ShinyNeRF, a novel framework capable of handling both isotropic and anisotropic reflections. Our method is capable of jointly estimating surface normals, tangents, specular concentration, and anisotropy magnitudes of an Anisotropic Spherical Gaussian (ASG) distribution, by learning an approximation of the outgoing radiance as an encoded mixture of isotropic von Mises-Fisher (vMF) distributions. Experimental results show that ShinyNeRF not only achieves state-of-the-art performance on digitizing anisotropic specular reflections, but also offers plausible physical interpretations and editing of material properties compared to existing methods.

[106] Prior-AttUNet: Retinal OCT Fluid Segmentation Based on Normal Anatomical Priors and Attention Gating

Li Yang, Yuting Liu

Main category: cs.CV

TL;DR: Prior-AttUNet: A hybrid dual-path segmentation model using generative anatomical priors and triple-attention mechanism for accurate macular edema segmentation in OCT images across multiple devices.

DetailsMotivation: Accurate segmentation of macular edema is essential for clinical diagnosis and management of vision-threatening conditions like age-related macular degeneration and diabetic macular edema. Challenges include ambiguous boundaries and cross-device heterogeneity in OCT images.

Method: Introduces Prior-AttUNet with hybrid dual-path architecture: 1) Variational autoencoder provides multi-scale normative anatomical priors, 2) Segmentation backbone with densely connected blocks and spatial pyramid pooling, 3) Novel triple-attention mechanism guided by anatomical priors to dynamically modulate feature importance across decoding stages.

Result: Achieved excellent performance on RETOUCH benchmark across three OCT devices: mean Dice similarity coefficients of 93.93% (Cirrus), 95.18% (Spectralis), and 93.47% (Topcon). Maintains low computational cost of 0.37 TFLOPs.

Conclusion: Prior-AttUNet demonstrates potential as a reliable tool for automated clinical analysis, effectively balancing segmentation precision and inference efficiency while handling cross-device heterogeneity in OCT imaging.

Abstract: Accurate segmentation of macular edema, a hallmark pathological feature in vision-threatening conditions such as age-related macular degeneration and diabetic macular edema, is essential for clinical diagnosis and management. To overcome the challenges of segmenting fluid regions in optical coherence tomography (OCT) images-notably ambiguous boundaries and cross-device heterogeneity-this study introduces Prior-AttUNet, a segmentation model augmented with generative anatomical priors. The framework adopts a hybrid dual-path architecture that integrates a generative prior pathway with a segmentation network. A variational autoencoder supplies multi-scale normative anatomical priors, while the segmentation backbone incorporates densely connected blocks and spatial pyramid pooling modules to capture richer contextual information. Additionally, a novel triple-attention mechanism, guided by anatomical priors, dynamically modulates feature importance across decoding stages, substantially enhancing boundary delineation. Evaluated on the public RETOUCH benchmark, Prior-AttUNet achieves excellent performance across three OCT imaging devices (Cirrus, Spectralis, and Topcon), with mean Dice similarity coefficients of 93.93%, 95.18%, and 93.47%, respectively. The model maintains a low computational cost of 0.37 TFLOPs, striking an effective balance between segmentation precision and inference efficiency. These results demonstrate its potential as a reliable tool for automated clinical analysis.

[107] A-QCF-Net: An Adaptive Quaternion Cross-Fusion Network for Multimodal Liver Tumor Segmentation from Unpaired Datasets

Arunkumar V, Firos V M, Senthilkumar S, Gangadharan G R

Main category: cs.CV

TL;DR: A-QCF-Net learns unified segmentation from unpaired CT/MRI using quaternion networks and adaptive cross-fusion for bidirectional knowledge transfer, outperforming unimodal baselines.

DetailsMotivation: Multimodal medical imaging provides complementary information but deep learning is limited by scarcity of large paired/aligned datasets. Need to leverage separate unpaired CT and MRI cohorts that are common in healthcare archives.

Method: Proposes Adaptive Quaternion Cross-Fusion Network (A-QCF-Net) with quaternion neural networks for parameter efficiency and expressive shared feature space. Core is Adaptive Quaternion Cross-Fusion (A-QCF) block - a data-driven attention module enabling bidirectional knowledge transfer between CT and MRI streams, dynamically modulating information flow to exchange modality-specific expertise.

Result: Joint training on unpaired LiTS (CT) and ATLAS (MRI) datasets achieves Tumor Dice scores of 76.7% on CT and 78.3% on MRI, exceeding unimodal nnU-Net baseline by 5.4% and 4.7% respectively. Explainability analysis with Grad-CAM/Grad-CAM++ confirms model focuses on relevant pathological structures.

Conclusion: Provides robust clinically viable paradigm for leveraging large unpaired imaging archives common in healthcare, enabling unified segmentation model learning from completely separate CT and MRI cohorts through adaptive cross-fusion.

Abstract: Multimodal medical imaging provides complementary information that is crucial for accurate delineation of pathology, but the development of deep learning models is limited by the scarcity of large datasets in which different modalities are paired and spatially aligned. This paper addresses this fundamental limitation by proposing an Adaptive Quaternion Cross-Fusion Network (A-QCF-Net) that learns a single unified segmentation model from completely separate and unpaired CT and MRI cohorts. The architecture exploits the parameter efficiency and expressive power of Quaternion Neural Networks to construct a shared feature space. At its core is the Adaptive Quaternion Cross-Fusion (A-QCF) block, a data driven attention module that enables bidirectional knowledge transfer between the two streams. By learning to modulate the flow of information dynamically, the A-QCF block allows the network to exchange abstract modality specific expertise, such as the sharp anatomical boundary information available in CT and the subtle soft tissue contrast provided by MRI. This mutual exchange regularizes and enriches the feature representations of both streams. We validate the framework by jointly training a single model on the unpaired LiTS (CT) and ATLAS (MRI) datasets. The jointly trained model achieves Tumor Dice scores of 76.7% on CT and 78.3% on MRI, significantly exceeding the strong unimodal nnU-Net baseline by margins of 5.4% and 4.7% respectively. Furthermore, comprehensive explainability analysis using Grad-CAM and Grad-CAM++ confirms that the model correctly focuses on relevant pathological structures, ensuring the learned representations are clinically meaningful. This provides a robust and clinically viable paradigm for unlocking the large unpaired imaging archives that are common in healthcare.

[108] FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection

Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman, Md. Rakibul Islam

Main category: cs.CV

TL;DR: FUSE: A hybrid AI-generated image detection system combining spectral (FFT) and semantic (CLIP) features with progressive two-stage training, achieving state-of-the-art performance across multiple datasets.

DetailsMotivation: The rapid advancement of generative AI models has created an urgent need for reliable detection of AI-generated images, as existing methods often fail on high-fidelity images and lack generalization across different generators.

Method: Hybrid system combining spectral features from Fast Fourier Transform with semantic features from CLIP’s Vision encoder. Features are fused into joint representation and trained progressively in two stages.

Result: State-of-the-art results on Chameleon benchmark, 91.36% mean accuracy on GenImage, 88.71% accuracy across all tested generators, and 94.96% mean Average Precision. Stage 2 training further improves performance for most generators.

Conclusion: Integrating spectral and semantic features enables robust, generalized detection of AI-generated images across diverse generators, outperforming existing methods particularly on high-fidelity images.

Abstract: The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP’s Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.

[109] Inference-based GAN Video Generation

Jingbo Yang, Adrian G. Bors

Main category: cs.CV

TL;DR: A novel VAE-GAN hybrid video generator with Markov chain recall mechanism for generating long, temporally consistent videos of hundreds/thousands of frames.

DetailsMotivation: Existing video generation models (GANs, VAEs, Diffusion Networks) struggle with temporal scaling - they can only generate short sequences (up to 16 frames) and quality degrades when trying to generate longer videos. There's a need for models that can generate meaningful, coherent long video sequences while maintaining quality.

Method: 1. Propose a VAE-GAN hybrid structure with variational encoder for inference capabilities. 2. Use two processing branches (content and movement). 3. Extend with Markov chain framework with recall mechanism, where each state represents a VAE-GAN short-length video generator. 4. Sequentially connect generated video sub-sequences to enable temporal dependencies.

Result: The approach enables generation of long videos composed of hundreds or thousands of frames while ensuring temporal continuity, consistency, and dynamics - overcoming the temporal scaling limitations of existing methods.

Conclusion: The proposed memory-efficient Markov chain approach with VAE-GAN generators successfully addresses the challenge of generating high-quality, temporally consistent long video sequences, representing a significant advancement in video generation capabilities.

Abstract: Video generation has seen remarkable progresses thanks to advancements in generative deep learning. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Generating models such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) and more recently Diffusion Networks have been used for generating short video sequences, usually of up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure, in order to enable the generation process with inference capabilities. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. In classical approaches when aiming to increase the generated video length, the resulting video quality degrades, particularly when considering generating significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, with each state representing a VAE-GAN short-length video generator. This setup allows for the sequential connection of generated video sub-sequences, enabling temporal dependencies, resulting in meaningful long video sequences.

[110] Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

Zheng Yin, Chengjian Li, Xiangbo Shu, Meiqi Cao, Rui Yan, Jinhui Tang

Main category: cs.CV

TL;DR: ST-MoE: A spatiotemporal mixture-of-experts model for multi-person motion prediction that uses bidirectional Mamba experts to capture complex dependencies while reducing parameters by 41.38% and achieving 3.6x training speedup.

DetailsMotivation: Existing methods have two main limitations: 1) Inflexible spatiotemporal representation due to reliance on positional encodings, and 2) High computational costs from quadratic time complexity of conventional attention mechanisms.

Method: Proposes Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE) with four distinct spatiotemporal experts, each specializing in different spatial or temporal dependencies. Uses bidirectional spatiotemporal Mamba as experts, sharing bidirectional temporal and spatial Mamba in distinct combinations for efficiency.

Result: Outperforms state-of-the-art in accuracy on four multi-person benchmark datasets, reduces model parameters by 41.38%, and achieves 3.6x speedup in training.

Conclusion: ST-MoE effectively captures complex spatio-temporal dependencies in human motion while significantly improving computational efficiency through the mixture-of-experts approach with bidirectional Mamba architecture.

Abstract: Comprehensively and flexibly capturing the complex spatio-temporal dependencies of human motion is critical for multi-person motion prediction. Existing methods grapple with two primary limitations: i) Inflexible spatiotemporal representation due to reliance on positional encodings for capturing spatiotemporal information. ii) High computational costs stemming from the quadratic time complexity of conventional attention mechanisms. To overcome these limitations, we propose the Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE), which flexibly explores complex spatio-temporal dependencies in human motion and significantly reduces computational cost. To adaptively mine complex spatio-temporal patterns from human motion, our model incorporates four distinct types of spatiotemporal experts, each specializing in capturing different spatial or temporal dependencies. To reduce the potential computational overhead while integrating multiple experts, we introduce bidirectional spatiotemporal Mamba as experts, each sharing bidirectional temporal and spatial Mamba in distinct combinations to achieve model efficiency and parameter economy. Extensive experiments on four multi-person benchmark datasets demonstrate that our approach not only outperforms state-of-art in accuracy but also reduces model parameter by 41.38% and achieves a 3.6x speedup in training. The code is available at https://github.com/alanyz106/ST-MoE.

[111] InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

Jinqi Xiao, Qing Yan, Liming Jiang, Zichuan Liu, Hao Kang, Shen Sang, Tiancheng Zhi, Jing Liu, Cheng Yang, Xin Lu, Bo Yuan

Main category: cs.CV

TL;DR: InstructMoLE: A novel framework using Instruction-Guided Mixture of Low-Rank Experts for parameter-efficient fine-tuning of Diffusion Transformers, addressing task interference and semantic drift through global routing based on user instructions.

DetailsMotivation: Existing parameter-efficient fine-tuning methods for Diffusion Transformers (DiTs) suffer from task interference when using monolithic adapters like LoRA. Mixture of Low-rank Experts (MoLE) architectures have potential but are limited by token-level routing that conflicts with global user instructions, causing spatial fragmentation and semantic drift in complex image generation tasks.

Method: InstructMoLE introduces Instruction-Guided Routing (IGR) that uses global routing signals derived from comprehensive user instructions to select a single, coherent expert council applied uniformly across all input tokens. Additionally, an output-space orthogonality loss is introduced to promote expert functional diversity and prevent representational collapse.

Result: Extensive experiments show InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks, demonstrating superior compositional control and fidelity to user intent.

Conclusion: InstructMoLE presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, effectively addressing limitations of previous methods by preserving global semantics and structural integrity through instruction-guided global routing.

Abstract: Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user’s comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.

[112] RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention

Zhan Chen, Zile Guo, Enze Zhu, Peirong Zhang, Xiaoxuan Liu, Lei Wang, Yidan Zhang

Main category: cs.CV

TL;DR: RAPTOR is a real-time high-resolution video prediction architecture that breaks the traditional trade-off between quality and speed using efficient spatiotemporal factorization.

DetailsMotivation: Video prediction faces a fundamental trilemma: high-resolution and perceptual quality come at the cost of real-time speed, which is critical for latency-sensitive applications like autonomous UAV navigation in dense urban environments where safety depends on foreseeing events from high-resolution imagery.

Method: RAPTOR uses a single-pass design to avoid iterative generation errors and latency. Its core innovation is Efficient Video Attention (EVA), a translator module that factorizes spatiotemporal modeling by alternating operations along spatial (S) and temporal (T) axes, reducing complexity from O((ST)²) to O(S+T) and memory to O(max(S,T)). This enables global context modeling at 512²+ resolution with patch-free design. A 3-stage training curriculum progressively refines predictions from coarse structure to sharp details.

Result: RAPTOR is the first predictor to exceed 30 FPS on Jetson AGX Orin for 512² video, achieving state-of-the-art on UAVid, KTH, and custom high-resolution datasets in PSNR, SSIM, and LPIPS metrics. Critically, it boosts mission success rate in real-world UAV navigation by 18%.

Conclusion: RAPTOR breaks the long-standing trade-off between video prediction quality and speed, enabling real-time high-resolution prediction that significantly improves safety and performance for latency-critical applications like autonomous UAV navigation, paving the way for safer anticipatory embodied agents.

Abstract: Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR’s single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.

[113] AstraNav-World: World Model for Foresight Control and Consistency

Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, Xiaolong Wu, Mu Xu, Shanghang Zhang

Main category: cs.CV

TL;DR: AstraNav-World: An end-to-end world model that jointly predicts future visual states and action sequences in a unified probabilistic framework for embodied navigation.

DetailsMotivation: Embodied navigation in open, dynamic environments requires accurate foresight of world evolution and action sequences. Current approaches often use decoupled "envision-then-plan" pipelines that suffer from cumulative errors and lack tight coupling between visual predictions and action planning.

Method: Integrates a diffusion-based video generator with a vision-language policy in a unified probabilistic framework. Uses synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals.

Result: Improved trajectory accuracy and higher success rates across diverse embodied navigation benchmarks. Demonstrates exceptional zero-shot capabilities in real-world testing, adapting to previously unseen scenarios without fine-tuning. Ablations confirm necessity of tight vision-action coupling and unified training.

Conclusion: AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics rather than overfitting to simulation data. Unifying foresight vision and control within a single generative model moves toward reliable, interpretable, and general-purpose embodied agents for open-ended real-world settings.

Abstract: Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled “envision-then-plan” pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

[114] CellMamba: Adaptive Mamba for Accurate and Efficient Cell Detection

Ruochen Liu, Yi Tian, Jiahao Wang, Hongbin Liu, Xianxu Hou, Jingxin Liu

Main category: cs.CV

TL;DR: CellMamba is a lightweight one-stage detector for biomedical cell detection that combines Mamba blocks with a novel Triple-Mapping Adaptive Coupling module, achieving superior accuracy with reduced model size and faster inference compared to CNN, Transformer, and Mamba baselines.

DetailsMotivation: Cell detection in pathological images faces challenges including densely packed objects, subtle inter-class differences, and severe background clutter, requiring specialized solutions beyond general object detection methods.

Method: Built on VSSD backbone with CellMamba Blocks that couple either NC-Mamba or Multi-Head Self-Attention with Triple-Mapping Adaptive Coupling (TMAC) module. TMAC splits channels into parallel branches with dual idiosyncratic and one consensus attention maps. Also includes Adaptive Mamba Head for multi-scale feature fusion.

Result: Outperforms CNN-based, Transformer-based, and Mamba-based baselines on CoNSeP and CytoDArk0 datasets in accuracy while significantly reducing model size and inference latency.

Conclusion: CellMamba provides an efficient and effective solution for high-resolution cell detection, offering superior performance with lightweight architecture suitable for biomedical applications.

Abstract: Cell detection in pathological images presents unique challenges due to densely packed objects, subtle inter-class differences, and severe background clutter. In this paper, we propose CellMamba, a lightweight and accurate one-stage detector tailored for fine-grained biomedical instance detection. Built upon a VSSD backbone, CellMamba integrates CellMamba Blocks, which couple either NC-Mamba or Multi-Head Self-Attention (MSA) with a novel Triple-Mapping Adaptive Coupling (TMAC) module. TMAC enhances spatial discriminability by splitting channels into two parallel branches, equipped with dual idiosyncratic and one consensus attention map, adaptively fused to preserve local sensitivity and global consistency. Furthermore, we design an Adaptive Mamba Head that fuses multi-scale features via learnable weights for robust detection under varying object sizes. Extensive experiments on two public datasets-CoNSeP and CytoDArk0-demonstrate that CellMamba outperforms both CNN-based, Transformer-based, and Mamba-based baselines in accuracy, while significantly reducing model size and inference latency. Our results validate CellMamba as an efficient and effective solution for high-resolution cell detection.

[115] Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

Steven Xiao, XIndi Zhang, Dechao Meng, Qi Wang, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: Knot Forcing is a streaming framework for real-time portrait animation that addresses error accumulation and motion discontinuities in autoregressive models through chunk-wise generation with identity preservation, temporal knot smoothing, and dynamic reference updating.

DetailsMotivation: Real-time portrait animation needs high visual quality, temporal coherence, ultra-low latency, and responsive control for interactive applications like virtual assistants. Existing diffusion models aren't causal enough for streaming, while autoregressive approaches suffer from error accumulation and motion discontinuities at chunk boundaries.

Method: Three key designs: 1) Chunk-wise generation with global identity preservation via cached KV states from reference images and local temporal modeling using sliding window attention; 2) Temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth transitions; 3) “Running ahead” mechanism that dynamically updates the reference frame’s temporal coordinate during inference to maintain semantic context ahead of current rollout.

Result: Enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences with real-time performance and strong visual stability on consumer-grade GPUs.

Conclusion: Knot Forcing successfully addresses challenges in streaming portrait animation by combining chunk-wise generation with identity preservation, temporal smoothing mechanisms, and dynamic reference updating, achieving the requirements for real-time interactive applications.

Abstract: Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A “running ahead” mechanism that dynamically updates the reference frame’s temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

[116] S&P 500 Stock’s Movement Prediction using CNN

Rahul Gupta

Main category: cs.CV

TL;DR: This paper applies Convolutional Neural Networks (CNN) to predict S&P 500 stock movements using multivariate raw market data including stock splits and dividends, treating historical data matrices as images for analysis.

DetailsMotivation: Traditional mathematical approaches dominate algorithmic trading, but recent neural network successes create opportunities for improved prediction. Most existing research uses simplified single-dimensional data without addressing real-world financial complexities like stock splits and dividends.

Method: Uses CNN (typically for image classification) on multivariate raw market data including stock split/dividend events. Treats historical data matrices as “images” (vectors of historical data matrices) for analysis. Can predict at individual stock, sector, or portfolio levels.

Result: The model achieves promising results in predicting stock movements using this novel approach of treating financial data as images for CNN analysis.

Conclusion: CNN can effectively analyze multivariate raw financial data when treated as image-like structures, offering a promising approach for stock prediction that handles real-world market complexities better than traditional single-dimensional methods.

Abstract: This paper is about predicting the movement of stock consist of S&P 500 index. Historically there are many approaches have been tried using various methods to predict the stock movement and being used in the market currently for algorithm trading and alpha generating systems using traditional mathematical approaches [1, 2]. The success of artificial neural network recently created a lot of interest and paved the way to enable prediction using cutting-edge research in the machine learning and deep learning. Some of these papers have done a great job in implementing and explaining benefits of these new technologies. Although most these papers do not go into the complexity of the financial data and mostly utilize single dimension data, still most of these papers were successful in creating the ground for future research in this comparatively new phenomenon. In this paper, I am trying to use multivariate raw data including stock split/dividend events (as-is) present in real-world market data instead of engineered financial data. Convolution Neural Network (CNN), the best-known tool so far for image classification, is used on the multi-dimensional stock numbers taken from the market mimicking them as a vector of historical data matrices (read images) and the model achieves promising results. The predictions can be made stock by stock, i.e., a single stock, sector-wise or for the portfolio of stocks.

[117] SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: SyncAnyone: A two-stage diffusion framework for high-quality AI video dubbing that achieves accurate lip-sync while preserving facial structure and background consistency.

DetailsMotivation: Existing mask-based training methods for video dubbing disrupt spatiotemporal context, causing instability in facial structure and background consistency despite achieving lip-sync accuracy.

Method: Two-stage learning framework: Stage 1 trains diffusion-based video transformer for masked mouth inpainting; Stage 2 uses mask-free tuning with synthetic pseudo-paired data to correct mask-induced artifacts.

Result: Achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation in wild lip-syncing scenarios.

Conclusion: SyncAnyone overcomes limitations of mask-based approaches by combining accurate motion modeling with high visual fidelity through a novel two-stage framework.

Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

[118] Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky

Main category: cs.CV

TL;DR: Scene-VLM: A fine-tuned vision-language model for video scene segmentation that jointly processes visual and textual cues, achieves SOTA performance with significant improvements over previous methods.

DetailsMotivation: Existing encoder-based methods for video scene segmentation have limitations: visual-centric biases, isolated shot classification without sequential dependencies, lack of narrative understanding, and poor explainability.

Method: Scene-VLM is a fine-tuned VLM framework that processes visual frames, transcriptions, and optional metadata for multimodal reasoning. It uses sequential prediction with causal dependencies, context-focus windows for temporal context, extracts confidence scores from token-level logits, and can generate natural-language rationales through targeted supervision.

Result: Achieves state-of-the-art performance on standard benchmarks. On MovieNet: +6 AP and +13.7 F1 improvements over previous leading method.

Conclusion: Scene-VLM successfully addresses limitations of previous methods by leveraging multimodal reasoning, sequential dependencies, and explainable decision-making, establishing a new SOTA for video scene segmentation.

Abstract: Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.

[119] Balancing Accuracy and Efficiency: CNN Fusion Models for Diabetic Retinopathy Screening

Md Rafid Islam, Rafsan Jany, Akib Ahmed, Mohammad Ashrafuzzaman Khan

Main category: cs.CV

TL;DR: Feature-level fusion of CNN backbones improves diabetic retinopathy screening accuracy on heterogeneous fundus images, with EfficientNet-B0 + DenseNet121 fusion achieving best performance (82.89% accuracy) while maintaining good computational efficiency.

DetailsMotivation: Diabetic retinopathy screening faces challenges due to limited specialist availability and variable image quality across different devices and populations. There's a need for accurate, efficient, and scalable screening methods that can generalize across diverse datasets.

Method: The study uses 11,156 fundus images from five public datasets (APTOS, EyePACS, IDRiD, Messidor, ODIR) for binary DR classification. It compares three pretrained CNN models (ResNet50, EfficientNet-B0, DenseNet121) against their pairwise and tri-fusion variants through feature-level fusion, evaluated across five independent runs.

Result: Fusion models consistently outperform single backbones. EfficientNet-B0 + DenseNet121 fusion achieves best overall performance (82.89% accuracy) with balanced F1-scores (normal: 83.60%, diabetic: 82.60%). EfficientNet-B0 is fastest (1.16 ms/image), while Eff+Den fusion offers optimal accuracy-latency balance. Tri-fusion is competitive but computationally expensive.

Conclusion: Lightweight feature fusion enhances generalization across heterogeneous datasets and supports scalable binary DR screening workflows where both accuracy and throughput are critical, offering a practical solution for large-scale screening applications.

Abstract: Diabetic retinopathy (DR) remains a leading cause of preventable blindness, yet large-scale screening is constrained by limited specialist availability and variable image quality across devices and populations. This work investigates whether feature-level fusion of complementary convolutional neural network (CNN) backbones can deliver accurate and efficient binary DR screening on globally sourced fundus images. Using 11,156 images pooled from five public datasets (APTOS, EyePACS, IDRiD, Messidor, and ODIR), we frame DR detection as a binary classification task and compare three pretrained models (ResNet50, EfficientNet-B0, and DenseNet121) against pairwise and tri-fusion variants. Across five independent runs, fusion consistently outperforms single backbones. The EfficientNet-B0 + DenseNet121 (Eff+Den) fusion model achieves the best overall mean performance (accuracy: 82.89%) with balanced class-wise F1-scores for normal (83.60%) and diabetic (82.60%) cases. While the tri-fusion is competitive, it incurs a substantially higher computational cost. Inference profiling highlights a practical trade-off: EfficientNet-B0 is the fastest (approximately 1.16 ms/image at batch size 1000), whereas the Eff+Den fusion offers a favorable accuracy–latency balance. These findings indicate that lightweight feature fusion can enhance generalization across heterogeneous datasets, supporting scalable binary DR screening workflows where both accuracy and throughput are critical.

[120] AI for Mycetoma Diagnosis in Histopathological Images: The MICCAI 2024 Challenge

Hyam Omar Ali, Sahar Alhesseen, Lamis Elkhair, Adrian Galdran, Ming Feng, Zhixiang Xiong, Zengming Lin, Kele Xu, Liang Hu, Benjamin Keel, Oliver Mills, James Battye, Akshay Kumar, Asra Aslam, Prasad Dutande, Ujjwal Baid, Bhakti Baheti, Suhas Gajre, Aravind Shrenivas Murali, Eung-Joo Lee, Ahmed Fahal, Rachid Jennane

Main category: cs.CV

TL;DR: The paper presents the mAIcetoma challenge which aimed to advance mycetoma diagnosis through AI solutions for segmenting grains and classifying types from histopathological images, with five teams achieving high accuracy results.

DetailsMotivation: Mycetoma is a neglected tropical disease causing severe tissue damage, primarily affecting poor rural communities. Diagnosis is challenging in low-resource settings where expert pathologists are limited, creating a need for automated AI solutions to improve diagnosis.

Method: Organized the Mycetoma MicroImage: Detect and Classify Challenge (mAIcetoma) to develop automated models for segmenting mycetoma grains and classifying mycetoma types from histopathological images. Provided the Mycetoma database (MyData) as a standardized dataset for participants to develop and test deep learning architectures.

Result: Five finalist teams participated and proposed various deep learning models. All models achieved high segmentation accuracy, demonstrating the importance of grain detection in diagnosis. Top-performing models showed significant performance in classifying mycetoma types.

Conclusion: The mAIcetoma challenge successfully advanced mycetoma diagnosis through AI solutions, showing that automated models can effectively segment grains and classify mycetoma types, potentially addressing diagnostic challenges in low-resource settings.

Abstract: Mycetoma is a neglected tropical disease caused by fungi or bacteria leading to severe tissue damage and disabilities. It affects poor and rural communities and presents medical challenges and socioeconomic burdens on patients and healthcare systems in endemic regions worldwide. Mycetoma diagnosis is a major challenge in mycetoma management, particularly in low-resource settings where expert pathologists are limited. To address this challenge, this paper presents an overview of the Mycetoma MicroImage: Detect and Classify Challenge (mAIcetoma) which was organized to advance mycetoma diagnosis through AI solutions. mAIcetoma focused on developing automated models for segmenting mycetoma grains and classifying mycetoma types from histopathological images. The challenge attracted the attention of several teams worldwide to participate and five finalist teams fulfilled the challenge objectives. The teams proposed various deep learning architectures for the ultimate goal of this challenge. Mycetoma database (MyData) was provided to participants as a standardized dataset to run the proposed models. Those models were evaluated using evaluation metrics. Results showed that all the models achieved high segmentation accuracy, emphasizing the necessitate of grain detection as a critical step in mycetoma diagnosis. In addition, the top-performing models show a significant performance in classifying mycetoma types.

[121] Diffusion Posterior Sampling for Super-Resolution under Gaussian Measurement Noise

Abu Hanif Muhammad Syarubany

Main category: cs.CV

TL;DR: This paper studies diffusion posterior sampling (DPS) for single-image super-resolution, implementing a likelihood-guided sampling procedure that combines unconditional diffusion priors with gradient-based conditioning for 4× super-resolution with Gaussian noise.

DetailsMotivation: The motivation is to develop effective posterior sampling methods for single-image super-resolution without needing to retrain diffusion models for each specific degradation operator, balancing diffusion priors with measurement consistency.

Method: The method implements a likelihood-guided sampling procedure combining unconditional diffusion priors with gradient-based conditioning to enforce measurement consistency for 4× super-resolution with additive Gaussian noise. It evaluates posterior sampling conditioning across guidance scales and noise levels.

Result: Moderate guidance improves reconstruction quality, with best configuration at PS scale 0.95 and noise standard deviation σ=0.01 (score 1.45231). The selected PS setting restores sharper edges and more coherent facial details compared to downsampled inputs, outperforming alternative conditioning strategies like MCG and PS-annealed.

Conclusion: The findings highlight the importance of balancing diffusion priors and measurement-gradient strength to obtain stable, high-quality reconstructions without retraining the diffusion model for each operator, demonstrating effective posterior sampling for super-resolution tasks.

Abstract: This report studies diffusion posterior sampling (DPS) for single-image super-resolution (SISR) under a known degradation model. We implement a likelihood-guided sampling procedure that combines an unconditional diffusion prior with gradient-based conditioning to enforce measurement consistency for $4\times$ super-resolution with additive Gaussian noise. We evaluate posterior sampling (PS) conditioning across guidance scales and noise levels, using PSNR and SSIM as fidelity metrics and a combined selection score $(\mathrm{PSNR}/40)+\mathrm{SSIM}$. Our ablation shows that moderate guidance improves reconstruction quality, with the best configuration achieved at PS scale $0.95$ and noise standard deviation $σ=0.01$ (score $1.45231$). Qualitative results confirm that the selected PS setting restores sharper edges and more coherent facial details compared to the downsampled inputs, while alternative conditioning strategies (e.g., MCG and PS-annealed) exhibit different texture fidelity trade-offs. These findings highlight the importance of balancing diffusion priors and measurement-gradient strength to obtain stable, high-quality reconstructions without retraining the diffusion model for each operator.

[122] Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang

Main category: cs.CV

TL;DR: Selective adversarial attacks targeting only high-entropy tokens (critical decision points) in VLMs achieve comparable semantic degradation to global attacks with smaller budgets, while exposing significant safety risks by converting benign outputs to harmful ones.

DetailsMotivation: Current entropy-based attacks maximize uncertainty at all decoding steps, assuming equal token contribution to instability. The authors argue that only a small fraction of high-entropy tokens (critical decision points) disproportionately govern output trajectories, suggesting more efficient and dangerous attacks can be developed by focusing on these positions.

Method: Propose Entropy-bank Guided Adversarial attacks (EGA) that concentrate adversarial perturbations on critical high-entropy tokens (about 20% of tokens) rather than all positions. This selective approach targets the vulnerable decision points that most influence output trajectories in autoregressive generation.

Result: Selective attacks achieve semantic degradation comparable to global methods with substantially smaller budgets. More critically, they convert 35-49% of benign outputs into harmful ones across multiple VLMs. These vulnerable high-entropy forks recur across diverse architectures, enabling 17-26% transferability to unseen targets. EGA achieves 93-95% attack success rates with high harmful conversion.

Conclusion: A small fraction of high-entropy tokens disproportionately controls VLM output trajectories, enabling efficient selective attacks that expose critical safety vulnerabilities. The recurrence of these vulnerable decision points across architectures suggests fundamental weaknesses in current VLM safety mechanisms that need addressing.

Abstract: Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.

[123] End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration

Zhenwei Yang, Yibo Ai, Weidong Zhang

Main category: cs.CV

TL;DR: XET-V2X is a multi-modal fused end-to-end tracking framework for V2X collaboration that unifies multi-view multimodal sensing in a shared spatiotemporal representation, using dual-layer spatial cross-attention for efficient viewpoint and modality alignment.

DetailsMotivation: Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios.

Method: XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead.

Result: Experiments on V2X-Seq-SPD, V2X-Sim-V2V, and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate robust and temporally stable perception in complex traffic scenarios.

Conclusion: XET-V2X achieves robust and temporally stable perception in complex traffic scenarios through its unified multi-modal fusion approach and efficient cross-modal interaction mechanism.

Abstract: Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated V2X-Sim-V2V and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate that XET-V2X achieves robust and temporally stable perception in complex traffic scenarios.

[124] Unsupervised Anomaly Detection in Brain MRI via Disentangled Anatomy Learning

Tao Yang, Xiuying Wang, Hao Liu, Guanzhong Gong, Lian-Ming Wu, Yu-Ping Wang, Lisheng Wang

Main category: cs.CV

TL;DR: Novel unsupervised brain lesion detection framework using disentangled representation and edge-to-image restoration to improve generalizability across multi-modality/multi-center MRIs and suppress abnormal residuals in pseudo-healthy image reconstruction.

DetailsMotivation: Current unsupervised anomaly detection methods for brain MRI have limited generalizability across different imaging modalities and centers due to reliance on specific imaging information in training data, and suffer from performance constraints due to abnormal residuals propagating to reconstructed pseudo-healthy images.

Method: Two novel modules: 1) Disentangled representation module that decouples brain MRI into imaging information and essential imaging-invariant anatomical images using brain anatomical priors and differentiable one-hot encoding; 2) Edge-to-image restoration module that reconstructs high-quality pseudo-healthy images by restoring anatomical representation from high-frequency edge information and recoupling with disentangled imaging information.

Result: Outperformed 17 state-of-the-art methods on nine public datasets (4,443 patients’ MRIs from multiple centers), achieving absolute improvements of +18.32% in AP (Average Precision) and +13.64% in DSC (Dice Similarity Coefficient).

Conclusion: The proposed framework effectively addresses limitations of current unsupervised methods by improving generalizability across diverse imaging conditions and suppressing abnormal residuals, leading to superior lesion detection performance in multi-modality, multi-center brain MRI analysis.

Abstract: Detection of various lesions in brain MRI is clinically critical, but challenging due to the diversity of lesions and variability in imaging conditions. Current unsupervised learning methods detect anomalies mainly through reconstructing abnormal images into pseudo-healthy images (PHIs) by normal samples learning and then analyzing differences between images. However, these unsupervised models face two significant limitations: restricted generalizability to multi-modality and multi-center MRIs due to their reliance on the specific imaging information in normal training data, and constrained performance due to abnormal residuals propagated from input images to reconstructed PHIs. To address these limitations, two novel modules are proposed, forming a new PHI reconstruction framework. Firstly, the disentangled representation module is proposed to improve generalizability by decoupling brain MRI into imaging information and essential imaging-invariant anatomical images, ensuring that the reconstruction focuses on the anatomy. Specifically, brain anatomical priors and a differentiable one-hot encoding operator are introduced to constrain the disentanglement results and enhance the disentanglement stability. Secondly, the edge-to-image restoration module is designed to reconstruct high-quality PHIs by restoring the anatomical representation from the high-frequency edge information of anatomical images, and then recoupling the disentangled imaging information. This module not only suppresses abnormal residuals in PHI by reducing abnormal pixels input through edge-only input, but also effectively reconstructs normal regions using the preserved structural details in the edges. Evaluated on nine public datasets (4,443 patients’ MRIs from multiple centers), our method outperforms 17 SOTA methods, achieving absolute improvements of +18.32% in AP and +13.64% in DSC.

[125] Scalable Class-Incremental Learning Based on Parametric Neural Collapse

Chuangxin Zhang, Guangfeng Lin, Enhui Zhao, Kaiyang Liao, Yajun Chen

Main category: cs.CV

TL;DR: SCL-PNC is a scalable class-incremental learning method that uses parametric neural collapse to enable demand-driven backbone expansion and dynamic ETF classifiers to handle evolving class distributions while maintaining feature consistency.

DetailsMotivation: Existing incremental learning methods freeze old model parameters but ignore structural efficiency issues, leading to feature differences between modules and class misalignment due to evolving class distributions in real-world scenarios.

Method: Proposes SCL-PNC with: 1) adapt-layer for demand-driven minimal-cost backbone expansion, 2) dynamic parametric ETF framework that evolves with incremental classes, 3) parallel expansion framework with knowledge distillation to align features across modules and prevent feature drift.

Result: Experiments on standard benchmarks demonstrate the method’s effectiveness and efficiency in handling model expansion with increasing categories while addressing class misalignment and feature consistency issues.

Conclusion: SCL-PNC successfully addresses incremental learning challenges by combining expandable backbone, adapt-layer, and parametric ETF classifier through neural collapse principles, providing a scalable solution for real-world class-incremental learning scenarios.

Abstract: Incremental learning often encounter challenges such as overfitting to new data and catastrophic forgetting of old data. Existing methods can effectively extend the model for new tasks while freezing the parameters of the old model, but ignore the necessity of structural efficiency to lead to the feature difference between modules and the class misalignment due to evolving class distributions. To address these issues, we propose scalable class-incremental learning based on parametric neural collapse (SCL-PNC) that enables demand-driven, minimal-cost backbone expansion by adapt-layer and refines the static into a dynamic parametric Equiangular Tight Frame (ETF) framework according to incremental class. This method can efficiently handle the model expansion question with the increasing number of categories in real-world scenarios. Additionally, to counteract feature drift in serial expansion models, the parallel expansion framework is presented with a knowledge distillation algorithm to align features across expansion modules. Therefore, SCL-PNC can not only design a dynamic and extensible ETF classifier to address class misalignment due to evolving class distributions, but also ensure feature consistency by an adapt-layer with knowledge distillation between extended modules. By leveraging neural collapse, SCL-PNC induces the convergence of the incremental expansion model through a structured combination of the expandable backbone, adapt-layer, and the parametric ETF classifier. Experiments on standard benchmarks demonstrate the effectiveness and efficiency of our proposed method. Our code is available at https://github.com/zhangchuangxin71-cyber/dynamic_ ETF2. Keywords: Class incremental learning; Catastrophic forgetting; Neural collapse;Knowledge distillation; Expanded model.

[126] LVLM-Aided Alignment of Task-Specific Vision Models

Alexander Koebler, Lukas Kuhn, Ingo Thon, Florian Buettner

Main category: cs.CV

TL;DR: LVLM-VA: A method using Large Vision Language Models to align small vision models with human domain knowledge, reducing reliance on spurious correlations and biases without fine-grained feedback.

DetailsMotivation: Small task-specific vision models are computationally efficient and explainable, but their explanations often reveal they rely on spurious correlations rather than aligning with human domain knowledge, leading to brittle real-world performance.

Method: LVLM-Aided Visual Alignment (LVLM-VA) uses a Large Vision Language Model to create a bidirectional interface: translating model behavior to natural language and mapping human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model.

Result: The method demonstrates substantial improvement in aligning model behavior with human specifications on both synthetic and real-world datasets, effectively reducing dependence on spurious features and group-specific biases.

Conclusion: LVLM-VA provides an efficient way to align small vision models with human domain knowledge using LVLMs, addressing the problem of models relying on spurious correlations while maintaining computational efficiency and explainability.

Abstract: In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model’s dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.

[127] Breaking Alignment Barriers: TPS-Driven Semantic Correlation Learning for Alignment-Free RGB-T Salient Object Detection

Lupiao Hu, Fasheng Wang, Fangmei Chen, Fuming Sun, Haojie Li

Main category: cs.CV

TL;DR: Proposes TPS-SCL, an efficient RGB-T salient object detection method for real-world unaligned image pairs using MobileViT encoder, Mamba scanning, and thin-plate spline alignment.

DetailsMotivation: Existing RGB-T SOD methods rely on manually aligned datasets and struggle with real-world unaligned image pairs due to spatial misalignment, scale variations, and viewpoint shifts, causing performance degradation.

Method: Uses dual-stream MobileViT encoder with Mamba scanning mechanisms, Semantic Correlation Constraint Module (SCCM) to suppress background interference, Thin-Plate Spline Alignment Module (TPSAM) for spatial alignment, and Cross-Modal Correlation Module (CMCM) for inter-modal integration.

Result: Extensive experiments show TPS-SCL achieves state-of-the-art performance among lightweight SOD methods and outperforms mainstream RGB-T SOD approaches on various datasets.

Conclusion: TPS-SCL effectively addresses the challenges of real-world unaligned RGB-T image pairs through efficient architecture design and specialized modules for alignment and correlation learning.

Abstract: Existing RGB-T salient object detection methods predominantly rely on manually aligned and annotated datasets, struggling to handle real-world scenarios with raw, unaligned RGB-T image pairs. In practical applications, due to significant cross-modal disparities such as spatial misalignment, scale variations, and viewpoint shifts, the performance of current methods drastically deteriorates on unaligned datasets. To address this issue, we propose an efficient RGB-T SOD method for real-world unaligned image pairs, termed Thin-Plate Spline-driven Semantic Correlation Learning Network (TPS-SCL). We employ a dual-stream MobileViT as the encoder, combined with efficient Mamba scanning mechanisms, to effectively model correlations between the two modalities while maintaining low parameter counts and computational overhead. To suppress interference from redundant background information during alignment, we design a Semantic Correlation Constraint Module (SCCM) to hierarchically constrain salient features. Furthermore, we introduce a Thin-Plate Spline Alignment Module (TPSAM) to mitigate spatial discrepancies between modalities. Additionally, a Cross-Modal Correlation Module (CMCM) is incorporated to fully explore and integrate inter-modal dependencies, enhancing detection performance. Extensive experiments on various datasets demonstrate that TPS-SCL attains state-of-the-art (SOTA) performance among existing lightweight SOD methods and outperforms mainstream RGB-T SOD approaches.

[128] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

Wen Jiang, Li Wang, Kangyao Huang, Wei Fan, Jinyuan Liu, Shaoyu Liu, Hongwei Duan, Bin Xu, Xiangyang Ji

Main category: cs.CV

TL;DR: LongFly is a spatiotemporal context modeling framework for long-horizon UAV vision-and-language navigation that improves semantic alignment and path planning by transforming fragmented historical data into structured representations.

DetailsMotivation: Current UAV vision-and-language navigation methods struggle with modeling long-horizon spatiotemporal context in complex post-disaster environments, leading to inaccurate semantic alignment and unstable path planning due to high information density, rapid viewpoint changes, and dynamic structures.

Method: LongFly uses a history-aware spatiotemporal modeling strategy with three modules: 1) slot-based historical image compression to distill multi-view observations into fixed-length contextual representations, 2) spatiotemporal trajectory encoding to capture temporal dynamics and spatial structure, and 3) prompt-guided multimodal integration to combine spatiotemporal context with current observations for time-based reasoning and waypoint prediction.

Result: LongFly outperforms state-of-the-art UAV VLN baselines by 7.89% in success rate and 6.33% in success weighted by path length, showing consistent performance improvements across both seen and unseen environments.

Conclusion: The proposed LongFly framework effectively addresses the challenges of long-horizon spatiotemporal context modeling in UAV vision-and-language navigation, demonstrating significant improvements in navigation performance for post-disaster search and rescue scenarios.

Abstract: Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89% in success rate and 6.33% in success weighted by path length, consistently across both seen and unseen environments.

[129] Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees

Haodong Lei, Hongsong Wang, Xin Geng, Liang Wang, Pan Zhou

Main category: cs.CV

TL;DR: ADT-Tree accelerates autoregressive image generation using adjacency-adaptive dynamic draft trees that adjust depth/width based on regional prediction difficulty, achieving 3x speedup.

DetailsMotivation: Autoregressive image models match diffusion quality but suffer from slow sequential inference (~2000 steps for 576x576 images). Existing speculative decoding with draft trees underperforms on visual AR models due to spatially varying token prediction difficulty and inconsistent acceptance rates across different image regions.

Method: Proposes Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree) that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. Initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, creating deeper trees in simple regions and wider trees in complex ones.

Result: Achieves speedups of 3.13x on MS-COCO 2017 and 3.05x on PartiPrompts. Integrates seamlessly with relaxed sampling methods like LANTERN for further acceleration.

Conclusion: ADT-Tree effectively addresses the spatial prediction difficulty variation in visual AR models, enabling significant acceleration while maintaining quality, with potential for further speedup through integration with relaxed sampling techniques.

Abstract: Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative decoding with draft trees accelerates LLMs yet underperforms on visual AR models due to spatially varying token prediction difficulty. We identify a key obstacle in applying speculative decoding to visual AR models: inconsistent acceptance rates across draft trees due to varying prediction difficulties in different image regions. We propose Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree), an adjacency-adaptive dynamic draft tree that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. ADT-Tree initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, yielding deeper trees in simple regions and wider trees in complex ones. The empirical evaluations on MS-COCO 2017 and PartiPrompts demonstrate that ADT-Tree achieves speedups of 3.13xand 3.05x, respectively. Moreover, it integrates seamlessly with relaxed sampling methods such as LANTERN, enabling further acceleration. Code is available at https://github.com/Haodong-Lei-Ray/ADT-Tree.

[130] Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue

Main category: cs.CV

TL;DR: DIOR is a training-free method that uses large vision-language models to generate conditional image embeddings by prompting them to describe images with single words related to given conditions, outperforming existing approaches including CLIP.

DetailsMotivation: Current vision foundation models like CLIP provide rich image representations but aren't designed to focus on specific conditional aspects (e.g., color, genre) indicated by textual conditions. There's a need for methods that can generate embeddings focusing on particular image attributes without requiring additional training.

Method: DIOR leverages large vision-language models (LVLMs) in a training-free approach: 1) Prompt the LVLM to describe an image with a single word related to a given condition, 2) Extract the hidden state vector of the LVLM’s last token as the conditional image embedding. This works for any image and condition without task-specific priors.

Result: DIOR outperforms existing training-free baselines including CLIP on conditional image similarity tasks. It also achieves superior performance compared to methods that require additional training across multiple settings.

Conclusion: DIOR provides a versatile, training-free solution for generating conditional image embeddings that focus on specific aspects indicated by textual conditions, demonstrating strong performance across various settings without requiring additional training or task-specific priors.

Abstract: Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM’s last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

[131] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, Yong-Jin Liu

Main category: cs.CV

TL;DR: Two-stage autoregressive framework adapts diffusion models for real-time interactive avatars with full-body gestures and natural talking/listening behaviors.

DetailsMotivation: Current diffusion-based avatar methods are non-causal and computationally expensive for streaming, while existing interactive approaches are limited to head-and-shoulder regions without full-body gestures.

Method: Two-stage autoregressive adaptation with autoregressive distillation and adversarial refinement, featuring Reference Sink, Reference-Anchored Positional Re-encoding (RAPR), and Consistency-Aware Discriminator for stability.

Result: Achieves state-of-the-art performance in generation quality, real-time efficiency, and interaction naturalness, enabling one-shot interactive human avatars with coherent gestures.

Conclusion: Proposed framework successfully enables real-time streaming interactive avatars with full-body gestures and natural behaviors, overcoming limitations of existing diffusion-based and interactive approaches.

Abstract: Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .

[132] EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

Yihan Hu, Xuelin Chen, Xiaodong Cun

Main category: cs.CV

TL;DR: EasyOmnimatte: A unified, end-to-end video omnimatte method that uses dual-expert diffusion model finetuning for high-quality foreground layer decomposition with associated effects.

DetailsMotivation: Existing video omnimatte methods rely on slow, multi-stage pipelines that don't fully exploit generative priors, producing suboptimal decompositions. The authors recognize that if video inpainting models can remove foreground effects, they should also be able to decompose them.

Method: Finetune a pretrained video inpainting diffusion model with two complementary experts: 1) Effect Expert with LoRA applied only to effect-sensitive DiT blocks to capture foreground and effects, and 2) Quality Expert with full LoRA finetuning to refine alpha mattes. During sampling, Effect Expert handles early high-noise steps while Quality Expert handles later low-noise steps.

Result: EasyOmnimatte sets new state-of-the-art for video omnimatte, significantly outperforming baselines in both quality and efficiency. The Dual-Expert strategy is validated through ablation studies, and the method enables various downstream tasks.

Conclusion: The proposed unified, end-to-end approach eliminates the need for multiple diffusion passes, reduces computational cost while maintaining output quality, and demonstrates that targeted LoRA application to specific DiT blocks is crucial for capturing associated effects in video omnimatte decomposition.

Abstract: Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

[133] DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

Divyansh Srivastava, Akshay Mehra, Pranav Maneriker, Debopam Sanyal, Vishnu Raj, Vijay Kamarshi, Fan Du, Joshua Kimball

Main category: cs.CV

TL;DR: DPAR is a decoder-only autoregressive model that dynamically aggregates image tokens into variable-sized patches based on information content, reducing computational costs while improving image generation quality.

DetailsMotivation: Standard decoder-only autoregressive image generation suffers from quadratic growth in token counts with resolution, increasing computational and memory demands. Fixed-length tokenization schemes are inefficient as they treat all image regions equally regardless of information content.

Method: DPAR uses next-token prediction entropy from a lightweight unsupervised autoregressive model as a criterion to dynamically merge tokens into larger patches based on information content. It makes minimal modifications to standard decoder architecture and trains with dynamically sized patches to create boundary-robust representations.

Result: DPAR reduces token count by 1.81x (Imagenet 256) and 2.06x (Imagenet 384), with up to 40% FLOPs reduction in training costs. It shows faster convergence and improves FID by up to 27.1% relative to baseline models.

Conclusion: DPAR demonstrates that dynamic token aggregation based on information content is an effective approach for efficient decoder-only autoregressive image generation, maintaining compatibility with multimodal frameworks while allocating more compute to high-information regions.

Abstract: Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.

[134] SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis

Mo Wang, Junfeng Xia, Wenhao Ye, Enyu Liu, Kaining Peng, Jianfeng Feng, Quanying Liu, Hongkai Wen

Main category: cs.CV

TL;DR: SLIM-Brain is a new fMRI foundation model that improves both data- and training-efficiency through a two-stage adaptive design with temporal saliency ranking and hierarchical 4D encoding, achieving SOTA performance with only 4k pre-training sessions and 30% GPU memory usage.

DetailsMotivation: Current fMRI foundation models face a dual bottleneck: atlas-based methods lose spatial details and need huge datasets, while atlas-free methods preserve spatial fidelity but are too computationally expensive for large-scale pre-training.

Method: Two-stage adaptive design: (1) lightweight temporal extractor ranks data windows by saliency, (2) 4D hierarchical encoder (Hiera-JEPA) learns voxel-level representations only from top-k selected windows while masking ~70% of patches.

Result: Establishes new state-of-the-art performance across seven public benchmarks while requiring only 4,000 pre-training sessions and approximately 30% of GPU memory compared to traditional voxel-level methods.

Conclusion: SLIM-Brain successfully addresses the data- and training-efficiency bottlenecks in fMRI foundation models, enabling atlas-free modeling with practical computational requirements while maintaining spatial fidelity and achieving superior performance.

Abstract: Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models. Atlas-free methods, on the other hand, operate directly on voxel-level information - preserving spatial fidelity but are prohibitively memory- and compute-intensive, making large-scale pre-training infeasible. We introduce SLIM-Brain (Sample-efficient, Low-memory fMRI Foundation Model for Human Brain), a new atlas-free foundation model that simultaneously improves both data- and training-efficiency. SLIM-Brain adopts a two-stage adaptive design: (i) a lightweight temporal extractor captures global context across full sequences and ranks data windows by saliency, and (ii) a 4D hierarchical encoder (Hiera-JEPA) learns fine-grained voxel-level representations only from the top-$k$ selected windows, while deleting about 70% masked patches. Extensive experiments across seven public benchmarks show that SLIM-Brain establishes new state-of-the-art performance on diverse tasks, while requiring only 4 thousand pre-training sessions and approximately 30% of GPU memory comparing to traditional voxel-level methods.

[135] Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer

Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, Hesheng Wang

Main category: cs.CV

TL;DR: Reloc-VGGT: A multi-view visual localization framework using early-fusion spatial integration with VGGT backbone, pose tokenizer, and sparse mask attention for real-time robust pose estimation.

DetailsMotivation: Traditional visual localization uses pair-wise pose regression with late-fusion strategies that insufficiently integrate spatial information and degrade in complex environments. There's a need for more effective multi-view spatial integration.

Method: 1) Uses VGGT backbone to encode multi-view 3D geometry; 2) Introduces pose tokenizer and projection module to exploit spatial relationships from multiple database views; 3) Proposes sparse mask attention strategy to reduce computational complexity from quadratic to enable real-time performance.

Result: Trained on ~8M posed image pairs, demonstrates strong accuracy and remarkable generalization. Extensive experiments across diverse public datasets validate effectiveness and efficiency, delivering high-quality camera pose estimates in real-time with robustness to unseen environments.

Conclusion: First visual localization framework with multi-view spatial integration through early-fusion mechanism, enabling robust operation in both structured and unstructured environments with real-time performance at scale.

Abstract: Visual localization has traditionally been formulated as a pair-wise pose regression problem. Existing approaches mainly estimate relative poses between two images and employ a late-fusion strategy to obtain absolute pose estimates. However, the late motion average is often insufficient for effectively integrating spatial information, and its accuracy degrades in complex environments. In this paper, we present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism, enabling robust operation in both structured and unstructured environments. Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry, and we introduce a pose tokenizer and projection module to more effectively exploit spatial relationships from multiple database views. Furthermore, we propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention, thereby enabling real-time performance at scale. Trained on approximately eight million posed image pairs, Reloc-VGGT demonstrates strong accuracy and remarkable generalization ability. Extensive experiments across diverse public datasets consistently validate the effectiveness and efficiency of our approach, delivering high-quality camera pose estimates in real time while maintaining robustness to unseen environments. Our code and models will be publicly released upon acceptance.https://github.com/dtc111111/Reloc-VGGT.

[136] CrownGen: Patient-customized Crown Generation via Point Diffusion Model

Juyoung Bae, Moo Hyun Son, Jiale Peng, Wanting Qu, Wener Chen, Zelin Qiu, Kaixin Li, Xiaojuan Chen, Yifan Lin, Hao Chen

Main category: cs.CV

TL;DR: CrownGen is a generative AI framework that automates patient-customized dental crown design using diffusion models on tooth-level point clouds, reducing manual labor while maintaining clinical quality.

DetailsMotivation: Digital crown design is labor-intensive and creates a bottleneck in restorative dentistry, limiting scalability and increasing costs. There's a need for automated solutions that can produce patient-customized crowns efficiently while maintaining clinical quality standards.

Method: CrownGen uses a denoising diffusion model on a novel tooth-level point cloud representation. It has two core components: a boundary prediction module to establish spatial priors, and a diffusion-based generative module that can synthesize high-fidelity morphology for multiple teeth in a single inference pass.

Result: The system was validated on 496 external scans and 26 clinical restoration cases. CrownGen surpassed state-of-the-art models in geometric fidelity and significantly reduced active design time. Clinical assessments by trained dentists confirmed that CrownGen-assisted crowns are statistically non-inferior in quality to those produced by expert technicians using manual workflows.

Conclusion: CrownGen offers a scalable solution to automate complex prosthetic modeling, potentially lowering costs, shortening turnaround times, and enhancing patient access to high-quality dental care through automated, clinically-validated crown design.

Abstract: Digital crown design remains a labor-intensive bottleneck in restorative dentistry. We present \textbf{CrownGen}, a generative framework that automates patient-customized crown design using a denoising diffusion model on a novel tooth-level point cloud representation. The system employs two core components: a boundary prediction module to establish spatial priors and a diffusion-based generative module to synthesize high-fidelity morphology for multiple teeth in a single inference pass. We validated CrownGen through a quantitative benchmark on 496 external scans and a clinical study of 26 restoration cases. Results demonstrate that CrownGen surpasses state-of-the-art models in geometric fidelity and significantly reduces active design time. Clinical assessments by trained dentists confirmed that CrownGen-assisted crowns are statistically non-inferior in quality to those produced by expert technicians using manual workflows. By automating complex prosthetic modeling, CrownGen offers a scalable solution to lower costs, shorten turnaround times, and enhance patient access to high-quality dental care.

[137] High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer

Shen Zheng, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Xingpei Ma, Junjie Cao, Hanfeng Zhao, Qiang Zhang, Shunsi Zhang, Xiao-Ping Zhang

Main category: cs.CV

TL;DR: A DiT-based framework for high-fidelity, long-duration human animation with improved facial/hand details and arbitrary-length video generation.

DetailsMotivation: Existing diffusion models struggle with long-duration video generation and lack fine-grained facial/hand details, limiting real-world high-quality applications.

Method: 1) Hybrid implicit guidance signals + sharpness guidance factor for facial/hand details; 2) Position Shift Adaptive Module for arbitrary-length videos; 3) Data augmentation + skeleton alignment for identity shape variations.

Result: Outperforms state-of-the-art approaches in both high-fidelity and long-duration human image animation.

Conclusion: Proposed framework successfully addresses limitations in long-duration video generation and fine-grained detail synthesis for human animation applications.

Abstract: Recent progress in diffusion models has significantly advanced the field of human image animation. While existing methods can generate temporally consistent results for short or regular motions, significant challenges remain, particularly in generating long-duration videos. Furthermore, the synthesis of fine-grained facial and hand details remains under-explored, limiting the applicability of current approaches in real-world, high-quality applications. To address these limitations, we propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos. First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance. Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module, which enables video generation of arbitrary length. Finally, we introduce a novel data augmentation strategy and a skeleton alignment model to reduce the impact of human shape variations across different identities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving superior performance in both high-fidelity and long-duration human image animation.

[138] Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

Zeyu Liang, Hailun Xia, Naichuan Zheng

Main category: cs.CV

TL;DR: PAN: Human-centric graph representation learning framework for multimodal action recognition that fuses RGB and skeleton modalities using spatiotemporal graphs of token embeddings around human joints.

DetailsMotivation: Existing multimodal methods fusing RGB and skeleton modalities suffer from inherent heterogeneity and fail to fully exploit complementary potential between modalities. Need for more effective and semantically coherent fusion approach.

Method: Proposes PAN framework with human-centric graph modeling: token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. Includes attention-based post calibration to reduce dependency on high-quality skeletal data. Two variants: PAN-Ensemble (dual-path GCNs with late fusion) and PAN-Unified (unified graph representation learning in single network).

Result: Both PAN-Ensemble and PAN-Unified achieve state-of-the-art performance on three widely used multimodal action recognition datasets in their respective settings (separate and unified modeling).

Conclusion: Human-centric graph representation learning effectively addresses heterogeneity between RGB and skeleton modalities, enabling more effective multimodal fusion for action recognition with state-of-the-art results.

Abstract: While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.

[139] AutoPP: Towards Automated Product Poster Generation and Optimization

Jiahao Fan, Yuxin Qin, Wei Feng, Yanyin Chen, Yaoyu Li, Ao Ma, Yixiu Li, Li Zhuang, Haoyi Bian, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law

Main category: cs.CV

TL;DR: AutoPP is an automated pipeline for product poster generation and optimization that eliminates human intervention by combining AI-generated posters with online CTR optimization using feedback data.

DetailsMotivation: Manual creation and optimization of product posters is labor-intensive and resource-consuming, requiring designers to craft appealing visuals and then manually optimize based on online performance metrics.

Method: Two-stage approach: 1) Generator uses unified design module to integrate background, text, and layout elements, then element rendering module encodes these into condition tokens for controllable generation; 2) Optimizer enhances CTR using online feedback, systematically replacing elements for fine-grained comparisons and using Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements.

Result: AutoPP achieves state-of-the-art results in both offline and online settings, supported by AutoPP1M dataset containing one million high-quality posters and feedback from over one million users.

Conclusion: AutoPP provides an effective automated solution for product poster generation and optimization, eliminating human intervention while improving performance through data-driven optimization based on real user feedback.

Abstract: Product posters blend striking visuals with informative text to highlight the product and capture customer attention. However, crafting appealing posters and manually optimizing them based on online performance is laborious and resource-consuming. To address this, we introduce AutoPP, an automated pipeline for product poster generation and optimization that eliminates the need for human intervention. Specifically, the generator, relying solely on basic product information, first uses a unified design module to integrate the three key elements of a poster (background, text, and layout) into a cohesive output. Then, an element rendering module encodes these elements into condition tokens, efficiently and controllably generating the product poster. Based on the generated poster, the optimizer enhances its Click-Through Rate (CTR) by leveraging online feedback. It systematically replaces elements to gather fine-grained CTR comparisons and utilizes Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements. Our work is supported by AutoPP1M, the largest dataset specifically designed for product poster generation and optimization, which contains one million high-quality posters and feedback collected from over one million users. Experiments demonstrate that AutoPP achieves state-of-the-art results in both offline and online settings. Our code and dataset are publicly available at: https://github.com/JD-GenX/AutoPP

[140] Automated Discovery of Parsimonious Spectral Indices via Normalized Difference Polynomials

Ali Lotfi, Adam Carter, Thuan Ha, Mohammad Meysami, Kwabena Nketia, Steve Shirtliffe

Main category: cs.CV

TL;DR: Automated method finds compact spectral indices for vegetation classification by generating polynomial combinations of normalized band differences and selecting optimal subsets via feature selection.

DetailsMotivation: Need for automated discovery of compact, interpretable spectral indices that maintain illumination invariance for vegetation classification in remote sensing applications.

Method: Generate all pairwise normalized differences from spectral bands, create polynomial combinations up to fixed degree, then apply feature selection methods (ANOVA filtering, recursive elimination, L1-regularized SVM) to select optimal small sets of indices.

Result: For Kochia detection with Sentinel-2, a single degree-2 index (product of two normalized differences from red-edge bands) achieved 96.26% accuracy; eight indices reached 97.70%. Selected features consistently involved degree-2 products from bands b4-b8, indicating spectral interactions are key.

Conclusion: The automated framework successfully finds compact, interpretable spectral indices that capture discriminative spectral interactions, can be deployed in platforms like Google Earth Engine, and is generalizable to other sensors and classification tasks.

Abstract: We introduce an automated way to find compact spectral indices for vegetation classification. The idea is to take all pairwise normalized differences from the spectral bands and then build polynomial combinations up to a fixed degree, which gives a structured search space that still keeps the illumination invariance needed in remote sensing. For a sensor with $n$ bands this produces $\binom{n}{2}$ base normalized differences, and the degree-2 polynomial expansion gives 1,080 candidate features for the 10-band Sentinel-2 configuration we use here. Feature selection methods (ANOVA filtering, recursive elimination, and $L_1$-regularized SVM) then pick out small sets of indices that reach the desired accuracy, so the final models stay simple and easy to interpret. We test the framework on Kochia (\textit{Bassia scoparia}) detection using Sentinel-2 imagery from Saskatchewan, Canada ($N = 2{,}318$ samples, 2022–2024). A single degree-2 index, the product of two normalized differences from the red-edge bands, already reaches 96.26% accuracy, and using eight indices only raises this to 97.70%. In every case the chosen features are degree-2 products built from bands $b_4$ through $b_8$, which suggests that the discriminative signal comes from spectral \emph{interactions} rather than individual band ratios. Because the indices involve only simple arithmetic, they can be deployed directly in platforms like Google Earth Engine. The same approach works for other sensors and classification tasks, and an open-source implementation (\texttt{ndindex}) is available.

[141] Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li, Pheng-Ann Heng

Main category: cs.CV

TL;DR: Medical MLLMs are vulnerable to real-world noise like imaging artifacts and text errors, limiting clinical use. This paper introduces a training-free framework (IMC) that leverages MLLMs’ inherent denoising capabilities to enhance robustness across visual and textual modalities.

DetailsMotivation: Medical MLLMs show promising clinical performance but are sensitive to real-world input perturbations like imaging artifacts and textual errors, which undermines their clinical applicability. Existing robustness studies focus mainly on text modality and require costly fine-tuning, making them inadequate for medical settings with complex noise patterns and strict safety standards.

Method: Proposes a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs’ inherent denoising capabilities. For visual modality: Perturbation-aware Denoising Calibration (PDC) uses MLLMs’ vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising: Self-instantiated Multi-agent System (SMS) exploits MLLMs’ self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents.

Result: Constructed a benchmark with 11 types of noise across image and text modalities on 2 datasets. Experimental results demonstrate state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs’ robustness in real clinical scenarios.

Conclusion: The proposed IMC framework effectively enhances medical MLLMs’ robustness to real-world perturbations without requiring fine-tuning, addressing critical safety concerns for clinical deployment by leveraging the models’ inherent denoising capabilities across both visual and textual modalities.

Abstract: Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs’ robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs’ inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs’ own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs’ self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs’ robustness in real clinical scenarios.

[142] A Lightweight Multi-Scale Attention Framework for Real-Time Spinal Endoscopic Instance Segmentation

Qi Lai, JunYan Li, Qiang Cai, Lei Wang, Tao Yan, XiaoKun Liang

Main category: cs.CV

TL;DR: Lightweight multi-scale attention framework (LMSF-A) for real-time spinal endoscopy instance segmentation that balances accuracy and speed with only 1.8M parameters and 8.8 GFLOPs, using novel architecture co-design across backbone, neck, and head.

DetailsMotivation: Real-time instance segmentation in spinal endoscopy is challenging due to narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes. Deployment is constrained by limited surgical hardware requiring models to balance accuracy and speed while remaining stable under small-batch training.

Method: LMSF-A framework with three key components: 1) Backbone uses C2f-Pro module combining RepViT-style re-parameterized convolution (RVB) with efficient multi-scale attention (EMA) for multi-branch training that collapses to single fast path for inference; 2) Neck improves cross-scale consistency and boundary detail using Scale-Sequence Feature Fusion (SSFF) and Triple Feature Encoding (TFE); 3) Head adopts Lightweight Multi-task Shared Head (LMSH) with shared convolutions and GroupNorm for parameter reduction and batch-1 stability.

Result: LMSF-A achieves highly competitive or better performance across all evaluation metrics while being much lighter than most instance segmentation methods (1.8M parameters, 8.8 GFLOPs). The model generalizes well to a public teeth benchmark. The paper also releases the clinically reviewed PELD dataset with 61 patients and 610 images with instance masks for adipose tissue, bone, ligamentum flavum, and nerve.

Conclusion: The proposed LMSF-A framework effectively addresses the challenges of real-time instance segmentation in spinal endoscopy through lightweight multi-scale attention co-design, achieving excellent accuracy-speed balance with minimal computational requirements and good generalization capability.

Abstract: Real-time instance segmentation for spinal endoscopy is important for identifying and protecting critical anatomy during surgery, but it is difficult because of the narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes. Deployment is also constrained by limited surgical hardware, so the model must balance accuracy and speed and remain stable under small-batch (even batch-1) training. We propose LMSF-A, a lightweight multi-scale attention framework co-designed across backbone, neck, and head. The backbone uses a C2f-Pro module that combines RepViT-style re-parameterized convolution (RVB) with efficient multi-scale attention (EMA), enabling multi-branch training while collapsing into a single fast path for inference. The neck improves cross-scale consistency and boundary detail using Scale-Sequence Feature Fusion (SSFF) and Triple Feature Encoding (TFE), which strengthens high-resolution features. The head adopts a Lightweight Multi-task Shared Head (LMSH) with shared convolutions and GroupNorm to reduce parameters and support batch-1 stability. We also release the clinically reviewed PELD dataset (61 patients, 610 images) with instance masks for adipose tissue, bone, ligamentum flavum, and nerve. Experiments show that LMSF-A is highly competitive (or even better than) in all evaluation metrics and much lighter than most instance segmentation methods requiring only 1.8M parameters and 8.8 GFLOPs, and it generalizes well to a public teeth benchmark. Code and dataset: https://github.com/hhwmortal/PELD-Instance-segmentation.

[143] Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs

Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He

Main category: cs.CV

TL;DR: ALEAHallu is an adversarial parametric editing framework that mitigates hallucinations in Vision-Language Models by identifying and fine-tuning hallucination-prone parameter clusters using adversarial optimization.

DetailsMotivation: VLMs suffer from persistent hallucination issues due to over-reliance on linguistic priors and insufficient visual feature integration. Existing heuristic decoding calibration strategies are non-trainable, limiting their optimization potential.

Method: Proposes an Activate-Locate-Edit Adversarially paradigm: 1) Construct activation dataset with grounded vs. hallucinatory responses, 2) Identify critical hallucination-prone parameter clusters via differential hidden state analysis, 3) Fine-tune clusters using adversarial prompts optimized to maximize visual neglect, forcing visual evidence prioritization.

Result: Evaluations on both generative and discriminative VLM tasks demonstrate significant effectiveness in alleviating hallucinations.

Conclusion: ALEAHallu provides a trainable framework for hallucination mitigation that forces VLMs to prioritize visual evidence over parametric biases, outperforming non-trainable heuristic approaches.

Abstract: While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs’ over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.

[144] iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception

Sarthak Mehrotra, Sairam V C Rebbapragada, Mani Hemanth Reddy Bonthu, Vineeth N Balasubramanian

Main category: cs.CV

TL;DR: iSHIFT is a lightweight 2.5B parameter GUI agent that uses implicit chain-of-thought reasoning with perception control tokens to switch between slow (detailed visual grounding) and fast (global cues) modes for efficient and precise interface interactions.

DetailsMotivation: Current MLLMs for GUI interaction struggle with balancing efficiency for routine tasks and precision for fine-grained interactions requiring exact visual grounding. They are also large and inflexible in adapting reasoning depth to task requirements.

Method: iSHIFT integrates latent thinking (implicit chain-of-thought) with a perception control module. It uses special perception tokens to guide attention to relevant screen regions, enabling the model to switch between slow mode (detailed visual grounding) and fast mode (global cues) based on task needs.

Result: Despite its compact 2.5B parameter size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets for GUI interaction tasks.

Conclusion: iSHIFT demonstrates that lightweight agents can achieve both efficiency and precision in GUI interactions through flexible reasoning modes and attention control, addressing key limitations of current MLLM approaches.

Abstract: Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.

[145] Patch-Discontinuity Mining for Generalized Deepfake Detection

Huanhuan Yuan, Yang Ping, Zhengqin Xu, Junyi Cao, Shuai Jia, Chao Ma

Main category: cs.CV

TL;DR: GenDF is a lightweight deepfake detection framework that transfers large vision models to detect fake facial images with only 0.28M parameters, achieving state-of-the-art cross-domain generalization.

DetailsMotivation: Existing deepfake detection methods struggle with generalization to unseen forgery patterns despite strong intra-domain performance. They often rely on handcrafted forensic cues and complex architectures that degrade when facing new manipulation techniques.

Method: GenDF transfers a large-scale vision model to deepfake detection with a compact design. It incorporates: 1) deepfake-specific representation learning to capture discriminative patterns, 2) feature space redistribution to mitigate distribution mismatch, and 3) classification-invariant feature augmentation for better generalization without extra parameters.

Result: Extensive experiments show GenDF achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings while requiring only 0.28M trainable parameters, demonstrating both effectiveness and efficiency.

Conclusion: GenDF provides a simple yet effective solution for deepfake detection that addresses generalization challenges through transfer learning and specialized feature processing, offering strong performance with minimal parameters.

Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of highly realistic fake facial images, posing serious threats to personal privacy and the integrity of online information. Existing deepfake detection methods often rely on handcrafted forensic cues and complex architectures, achieving strong performance in intra-domain settings but suffering significant degradation when confronted with unseen forgery patterns. In this paper, we propose GenDF, a simple yet effective framework that transfers a powerful large-scale vision model to the deepfake detection task with a compact and neat network design. GenDF incorporates deepfake-specific representation learning to capture discriminative patterns between real and fake facial images, feature space redistribution to mitigate distribution mismatch, and a classification-invariant feature augmentation strategy to enhance generalization without introducing additional trainable parameters. Extensive experiments demonstrate that GenDF achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings while requiring only 0.28M trainable parameters, validating the effectiveness and efficiency of the proposed framework.

[146] Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He, Xingshuo Han, Shengmin Xu, Xinyi Huang

Main category: cs.CV

TL;DR: BadVSFM is the first effective backdoor attack framework for prompt-driven video segmentation foundation models, overcoming the ineffectiveness of traditional backdoor attacks by using a two-stage strategy to separate triggered and clean representations.

DetailsMotivation: Prompt-driven Video Segmentation Foundation Models (VSFMs) are increasingly deployed in safety-critical applications, raising concerns about backdoor threats. However, traditional backdoor attacks are ineffective on VSFMs (ASR < 5%), creating a need for specialized attack methods.

Method: BadVSFM uses a two-stage strategy: (1) Steer the image encoder so triggered frames map to a target embedding while clean frames align with a clean reference encoder; (2) Train the mask decoder so triggered frame-prompt pairs produce a shared target mask across prompt types, while clean outputs stay close to a reference decoder.

Result: Extensive experiments on two datasets and five VSFMs show BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. The attack remains effective against four representative defenses.

Conclusion: BadVSFM reveals a previously underexplored vulnerability in current VSFMs, demonstrating that specialized backdoor attacks can effectively compromise these models despite the ineffectiveness of traditional methods, highlighting security concerns for safety-critical applications.

Abstract: Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask decoder so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference decoder. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.

[147] Pre-training Vision Transformers with Formula-driven Supervised Learning

Hirokatsu Kataoka, Sora Takashima, Ryo Hayamizu, Ryosuke Yamada, Kodai Nakashima, Xinyu Zhang, Edgar Josafat Martinez-Noriega, Nakamasa Inoue, Rio Yokota

Main category: cs.CV

TL;DR: Formula-driven supervised learning (FDSL) using synthetic images matches/exceeds ImageNet-21k performance and approaches JFT-300M without real images, human supervision, or self-supervision for ViT pre-training.

DetailsMotivation: To overcome privacy/copyright issues, labeling costs/errors, and biases associated with real images in pre-training vision models, while achieving competitive performance with synthetic data.

Method: Pre-trained Vision Transformers using formula-generated synthetic images (FDSL), specifically ExFractalDB-21k, with comparable hyperparameters and epochs to real-image baselines. Tested hypotheses about object contours and label creation complexity.

Result: FDSL achieved 83.8% top-1 accuracy on ImageNet-1k fine-tuning, outperforming ImageNet-21k (83.0%) and approaching JFT-300M (84.1%), while using 14.2x fewer images than JFT-300M. Simple contour datasets matched fractal performance, and increased pre-training difficulty improved fine-tuning accuracy.

Conclusion: Synthetic formula-generated images offer tremendous potential for pre-training general vision models, avoiding real-image limitations while achieving competitive performance, with object contours being key and task difficulty correlating with downstream accuracy.

Abstract: In the present work, we show that the performance of formula-driven supervised learning (FDSL) can match or even exceed that of ImageNet-21k and can approach that of the JFT-300M dataset without the use of real images, human supervision, or self-supervision during the pre-training of vision transformers (ViTs). For example, ViT-Base pre-trained on ImageNet-21k and JFT-300M showed 83.0 and 84.1% top-1 accuracy when fine-tuned on ImageNet-1k, and FDSL showed 83.8% top-1 accuracy when pre-trained under comparable conditions (hyperparameters and number of epochs). Especially, the ExFractalDB-21k pre-training was calculated with x14.2 fewer images compared with JFT-300M. Images generated by formulas avoid privacy and copyright issues, labeling costs and errors, and biases that real images suffer from, and thus have tremendous potential for pre-training general models. To understand the performance of the synthetic images, we tested two hypotheses, namely (i) object contours are what matter in FDSL datasets and (ii) an increased number of parameters for label creation improves performance in FDSL pre-training. To test the former hypothesis, we constructed a dataset that consisted of simple object contour combinations. We found that this dataset matched the performance of fractal databases. For the latter hypothesis, we found that increasing the difficulty of the pre-training task generally leads to better fine-tuning accuracy.

[148] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi

Main category: cs.CV

TL;DR: MAI-UI is a family of foundation GUI agents (2B-235B variants) that addresses key deployment challenges through self-evolving data, device-cloud collaboration, and online RL, achieving SOTA results on GUI grounding and mobile navigation benchmarks.

DetailsMotivation: GUI agents could revolutionize human-computer interaction, but face four key deployment challenges: lack of native agent-user interaction, UI-only operation limits, absence of practical deployment architecture, and brittleness in dynamic environments.

Method: Unified methodology with: 1) self-evolving data pipeline expanding navigation data to include user interaction and MCP tool calls, 2) native device-cloud collaboration system routing execution by task state, and 3) online RL framework with advanced optimizations for scaling parallel environments and context length.

Result: SOTA across benchmarks: 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, 49.2% on UI-Vision; 76.7% on AndroidWorld; 41.7% on MobileWorld. Online RL scaling shows +5.2 points from 32→512 parallel environments and +4.3 points from 15→50 environment steps. Device-cloud collaboration improves on-device performance by 33% and reduces cloud calls by 40%.

Conclusion: MAI-UI successfully addresses key GUI agent deployment challenges through its unified methodology, establishing new SOTA performance while enabling practical deployment with privacy-preserving device-cloud collaboration and scalable online RL optimization.

Abstract: The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

[149] Yume-1.5: A Text-Controlled Interactive World Generation Model

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang

Main category: cs.CV

TL;DR: A novel framework called \method generates interactive, continuous worlds from single images or text prompts using diffusion models with real-time performance and text control.

DetailsMotivation: Existing diffusion-based world generation methods suffer from large parameter sizes, slow inference, growing historical context issues, and lack text control, limiting real-time performance.

Method: Three core components: 1) Long-video generation with unified context compression and linear attention, 2) Real-time streaming acceleration via bidirectional attention distillation and enhanced text embedding, 3) Text-controlled world event generation.

Result: The framework enables realistic, interactive, continuous world generation from single images or text prompts with keyboard-based exploration capabilities.

Conclusion: \method addresses key limitations of current diffusion-based world generation methods by providing real-time performance, text control, and interactive exploration in a unified framework.

Abstract: Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

[150] Learning Association via Track-Detection Matching for Multi-Object Tracking

Momir Adžemović

Main category: cs.CV

TL;DR: TDLP is a tracking-by-detection method that learns association via link prediction between tracks and detections, outperforming both heuristic-based and end-to-end approaches while maintaining computational efficiency.

DetailsMotivation: Current multi-object tracking has two problematic paradigms: tracking-by-detection methods rely on handcrafted heuristics, while end-to-end approaches are computationally expensive. There's a need for a method that learns association from data while remaining efficient.

Method: TDLP performs per-frame association via link prediction between tracks and detections, predicting the correct continuation of each track at every frame. It’s designed primarily for geometric features like bounding boxes, with optional incorporation of pose and appearance cues.

Result: Extensive experiments show TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods on multiple benchmarks.

Conclusion: Link prediction-based association is more effective than metric learning-based approaches, especially for handling heterogeneous features like detection bounding boxes. TDLP provides an efficient, modular solution that learns association from data without handcrafted rules.

Abstract: Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}.

[151] ProEdit: Inversion-based Editing From Prompts Done Right

Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng

Main category: cs.CV

TL;DR: ProEdit improves inversion-based visual editing by addressing over-reliance on source information through KV-mix (attention mixing) and Latents-Shift (latent perturbation), achieving better attribute changes while maintaining consistency.

DetailsMotivation: Existing inversion-based editing methods overly rely on source image information during sampling, which negatively affects target edits - they often fail to properly change attributes like pose, number, or color as instructed.

Method: Two-pronged approach: 1) KV-mix mixes KV features of source and target in edited regions to mitigate source influence while maintaining background consistency; 2) Latents-Shift perturbs the edited region of source latent to eliminate inverted latent influence on sampling.

Result: Achieves state-of-the-art performance on several image and video editing benchmarks; plug-and-play design seamlessly integrates with existing inversion and editing methods like RF-Solver, FireFlow and UniEdit.

Conclusion: ProEdit effectively addresses the over-reliance problem in inversion-based editing through attention and latent modifications, enabling better attribute changes while maintaining editing consistency.

Abstract: Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject’s atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

[152] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang

Main category: cs.CV

TL;DR: Bi-directional Perceptual Shaping (BiPS) improves vision-language models by using bidirectional where-to-look signals to shape perception during training, enhancing visual reasoning without external tools.

DetailsMotivation: Current VLMs rely on intermediate visual cues via external tools or latent tokens, but these approaches miss fine-grained visual evidence, have poor cross-domain generalization, and incur high inference costs.

Method: BiPS transforms question-conditioned masked views into bidirectional signals: 1) KL-consistency between original image and evidence-preserving view for complete pixel coverage, 2) KL-separation between original and evidence-ablated view to prevent text-only shortcuts.

Result: Boosts Qwen2.5-VL-7B by 8.2% average across eight benchmarks, shows strong out-of-domain generalization to unseen datasets and image types.

Conclusion: BiPS effectively shapes visual perception during training, improving fine-grained visual reasoning and generalization without external tools or high inference costs.

Abstract: Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

[153] Co-Teaching for Unsupervised Domain Adaptation and Expansion

Hailan Lin, Qijie Wei, Kaibin Tian, Ruixiang Zhao, Xirong Li

Main category: cs.CV

TL;DR: The paper introduces Co-Teaching (CT) to address cross-domain visual ambiguity in Unsupervised Domain Expansion, using dual-teacher knowledge distillation and mixup techniques to improve model performance on both source and target domains.

DetailsMotivation: The paper challenges the assumption that domain-specific models always handle their own domain well, revealing cross-domain visual ambiguity where samples from one domain can be visually similar to another domain. These ambiguous samples are typically minority cases that get overlooked by domain-specific models but could be better handled by models from the other domain.

Method: Proposes Co-Teaching (CT) with two components: 1) kdCT (knowledge distillation based CT) uses a dual-teacher architecture where each teacher specializes in one domain, enhancing the student network’s ability to handle cross-domain ambiguity through knowledge distillation. 2) miCT (mixup based CT) further enhances the student’s generalization ability using mixup techniques.

Result: Extensive experiments on image classification and driving-scene segmentation demonstrate the viability of CT for Unsupervised Domain Expansion, showing improved performance on both source and target domains while addressing cross-domain visual ambiguity.

Conclusion: The proposed Co-Teaching framework effectively addresses cross-domain visual ambiguity in Unsupervised Domain Expansion, enabling models to maintain source domain performance while adapting to target domains, with applications in both image classification and segmentation tasks.

Abstract: Unsupervised Domain Adaptation (UDA) essentially trades a model’s performance on a source domain for improving its performance on a target domain. To overcome this, Unsupervised Domain Expansion (UDE) has been introduced, which adapts the model to the target domain while preserving its performance in the source domain. In both UDA and UDE, a model tailored to a given domain is assumed to well handle samples from the given domain. We question the assumption by reporting the existence of cross-domain visual ambiguity: Due to the unclear boundary between the two domains, samples from one domain can be visually close to the other domain. Such sorts of samples are typically in the minority in their host domain, so they tend to be overlooked by the domain-specific model, but can be better handled by a model from the other domain. We exploit this finding by proposing Co-Teaching (CT), which is instantiated with knowledge distillation based CT (kdCT) plus mixup based CT (miCT). Specifically, kdCT leverages a dual-teacher architecture to enhance the student network’s ability to handle cross-domain ambiguity. Meanwhile, miCT further enhances the generalization ability of the student. Extensive experiments on image classification and driving-scene segmentation show the viability of CT for UDE.

[154] Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu

Main category: cs.CV

TL;DR: This paper provides the first comprehensive survey of self-supervised learning for skeleton-based action representation, categorizing approaches into context-based, generative, and contrastive learning, and proposes a novel multi-granularity SSL method that achieves superior generalization across multiple downstream tasks.

DetailsMotivation: Skeleton-based action understanding faces unique challenges due to sparse spatial structures, diverse representation forms, lack of background clues, and temporal dimensions. While SSL has shown promise for skeleton data, there's no systematic review of existing approaches, and most current methods focus on single paradigms, single-level representations, and only action recognition evaluation, leaving generalization capabilities underexplored.

Method: The paper conducts a comprehensive survey of skeleton-based SSL methods, categorizing them into context-based, generative learning, and contrastive learning approaches. It then proposes a novel SSL method that integrates versatile representation learning objectives of different granularity to boost generalization capacity across multiple skeleton downstream tasks.

Result: Extensive experiments on three large-scale datasets demonstrate that the proposed multi-granularity SSL method achieves superior generalization performance on various downstream tasks including recognition, retrieval, detection, and few-shot learning, outperforming existing single-paradigm approaches.

Conclusion: This work provides the first systematic review of skeleton-based SSL, reveals limitations of current single-paradigm approaches, and introduces an effective multi-granularity SSL method that significantly improves generalization across diverse skeleton action understanding tasks, opening new directions for future research in this field.

Abstract: Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.

[155] Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization

Mengshi Qi, Hongwei Ji, Wulian Yun, Xianlin Zhang, Huadong Ma

Main category: cs.CV

TL;DR: Proposes a few-shot temporal action localization method using Chain-of-Evidence multimodal reasoning that combines visual and textual information to improve localization performance.

DetailsMotivation: Existing few-shot TAL methods focus only on video-level information and neglect textual information, which could provide valuable semantic support for action localization. There's a need to reduce dependence on large annotated datasets while improving localization accuracy.

Method: 1) Novel few-shot learning framework capturing action commonalities and variations with semantic-aware text-visual alignment module. 2) Chain-of-Evidence (CoE) reasoning method that progressively guides VLM and LLM to generate CoE text descriptions capturing action variance better than visual features alone.

Result: Significantly outperforms existing methods on ActivityNet1.3, THUMOS14, and a newly collected Human-related Anomaly Localization Dataset in both single-instance and multi-instance scenarios.

Conclusion: The proposed multimodal approach combining visual and textual information through Chain-of-Evidence reasoning effectively improves few-shot temporal action localization performance by leveraging semantic information from text descriptions.

Abstract: Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the action localization task. To address these issues, in this work, we propose a new few-shot temporal action localization method by Chain-of-Evidence multimodal reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level, we design a Chain-of-Evidence (CoE) reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoE text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3, THUMOS14 and our newly collected Human-related Anomaly Localization Dataset. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. Our source code and data are available at https://github.com/MICLAB-BUPT/VAL-VLM.

[156] Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

Kaleab A. Kinfu, Carolina Pacheco, Alice D. Sperry, Deana Crocetti, Bahar Tunçgenç, Stewart H. Mostofsky, René Vidal

Main category: cs.CV

TL;DR: CAMI-2DNet: A scalable, interpretable deep learning approach for assessing motor imitation in autism using video data without data normalization, cleaning, or human annotations.

DetailsMotivation: Motor imitation impairments in autism spectrum conditions (ASCs) could serve as a phenotype for addressing autism heterogeneity. Traditional assessment methods are subjective and labor-intensive, while existing computerized methods (CAMI-3D and CAMI-2D) still require data normalization, cleaning, and human annotations.

Method: CAMI-2DNet uses an encoder-decoder architecture to map videos to motion encodings disentangled from nuisance factors (body shape, camera views). It learns from synthetic data generated by motion retargeting of virtual characters (reshuffling motion, body shape, camera views) and real participant data. Similarity scores between motion encodings assess imitation quality.

Result: CAMI-2DNet shows strong correlation with human scores and outperforms CAMI-2D in discriminating ASC vs neurotypical children. It performs comparably to CAMI-3D while being more practical by operating directly on video data without data normalization or human annotations.

Conclusion: CAMI-2DNet provides a scalable, interpretable solution for motor imitation assessment in autism that eliminates the need for labor-intensive data processing and human annotations, offering greater practicality than existing methods while maintaining strong performance.

Abstract: Motor imitation impairments are commonly reported in individuals with autism spectrum conditions (ASCs), suggesting that motor imitation could be used as a phenotype for addressing autism heterogeneity. Traditional methods for assessing motor imitation are subjective, labor-intensive, and require extensive human training. Modern Computerized Assessment of Motor Imitation (CAMI) methods, such as CAMI-3D for motion capture data and CAMI-2D for video data, are less subjective. However, they rely on labor-intensive data normalization and cleaning techniques, and human annotations for algorithm training. To address these challenges, we propose CAMI-2DNet, a scalable and interpretable deep learning-based approach to motor imitation assessment in video data, which eliminates the need for data normalization, cleaning and annotation. CAMI-2DNet uses an encoder-decoder architecture to map a video to a motion encoding that is disentangled from nuisance factors such as body shape and camera views. To learn a disentangled representation, we employ synthetic data generated by motion retargeting of virtual characters through the reshuffling of motion, body shape, and camera views, as well as real participant data. To automatically assess how well an individual imitates an actor, we compute a similarity score between their motion encodings, and use it to discriminate individuals with ASCs from neurotypical (NT) individuals. Our comparative analysis demonstrates that CAMI-2DNet has a strong correlation with human scores while outperforming CAMI-2D in discriminating ASC vs NT children. Moreover, CAMI-2DNet performs comparably to CAMI-3D while offering greater practicality by operating directly on video data and without the need for ad-hoc data normalization and human annotations.

[157] Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

Yuxuan Wang, Xuanyu Yi, Qingshan Xu, Yuan Zhou, Long Chen, Hanwang Zhang

Main category: cs.CV

TL;DR: CP-GS is a framework that personalizes 3D Gaussian Splatting scenes from a single reference image by progressively propagating appearance to novel views, mitigating viewpoint bias through pre-trained image-to-3D generation and iterative LoRA fine-tuning.

DetailsMotivation: Existing methods for personalizing 3D scenes from a single reference image suffer from viewpoint bias due to limited perspective, leading to inconsistent results across different views and poor referential consistency with the input image.

Method: CP-GS integrates pre-trained image-to-3D generation with iterative LoRA fine-tuning to extract and extend reference appearance. It uses a view-consistent generation process guided by geometric cues to produce faithful multi-view guidance images and personalized 3DGS outputs.

Result: Extensive experiments on real-world scenes show CP-GS effectively mitigates viewpoint bias and achieves high-quality personalization that significantly outperforms existing methods.

Conclusion: CP-GS provides an effective framework for single-image 3D scene personalization that addresses viewpoint bias through progressive appearance propagation and view-consistent generation, enabling better multi-view and referential consistency.

Abstract: Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods.

[158] Visual Explanation via Similar Feature Activation for Metric Learning

Yi Liao, Ugochukwu Ejike Akpudo, Jue Zhang, Yongsheng Gao, Jun Zhou, Wenyi Zeng, Weichuan Zhang

Main category: cs.CV

TL;DR: SFAM is a new visual explanation method for metric learning models that can’t use traditional CAM methods, using similarity-based feature importance scoring.

DetailsMotivation: Existing visual explanation methods like CAM, Grad-CAM, and Relevance-CAM only work with softmax-based CNNs that have fully connected classifiers, but cannot be applied to metric learning models which lack such classifiers.

Method: Proposes Similar Feature Activation Map (SFAM) with channel-wise contribution importance score (CIS) that measures feature importance based on similarity between image embeddings, then linearly combines importance weights with CNN feature maps.

Result: Quantitative and qualitative experiments demonstrate that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity metrics.

Conclusion: SFAM successfully addresses the limitation of existing CAM methods by enabling visual explanations for metric learning models without fully connected classifiers.

Abstract: Visual explanation maps enhance the trustworthiness of decisions made by deep learning models and offer valuable guidance for developing new algorithms in image recognition tasks. Class activation maps (CAM) and their variants (e.g., Grad-CAM and Relevance-CAM) have been extensively employed to explore the interpretability of softmax-based convolutional neural networks, which require a fully connected layer as the classifier for decision-making. However, these methods cannot be directly applied to metric learning models, as such models lack a fully connected layer functioning as a classifier. To address this limitation, we propose a novel visual explanation method termed Similar Feature Activation Map (SFAM). This method introduces the channel-wise contribution importance score (CIS) to measure feature importance, derived from the similarity measurement between two image embeddings. The explanation map is constructed by linearly combining the proposed importance weights with the feature map from a CNN model. Quantitative and qualitative experiments show that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity as the similarity metric.

[159] Shared & Domain Self-Adaptive Experts with Frequency-Aware Discrimination for Continual Test-Time Adaptation

JianChao Zhao, Chenhao Ding, Songlin Dong, Jiangyang Li, Qiang Wang, Yuhang He, Yihong Gong

Main category: cs.CV

TL;DR: A frequency-aware shared and self-adaptive expert framework for Continual Test-Time Adaptation that balances adaptation to evolving domains while retaining domain knowledge, with a new CRS benchmark for realistic evaluation.

DetailsMotivation: Existing CTTA methods struggle to balance adaptation to new domains and preventing forgetting of previously learned domains, leading to decreased efficiency and stability in real-world scenarios where domains reappear periodically.

Method: Proposes a dual-branch expert architecture with general feature extraction and dynamic domain-specific modeling, plus an online Frequency-aware Domain Discriminator (FDD) that uses low-frequency image signals for domain shift detection to guide expert resource allocation.

Result: Outperforms existing approaches on both classification and segmentation CTTA tasks under standard and new Continual Repeated Shifts (CRS) benchmark settings, with ablations confirming effectiveness and robustness.

Conclusion: The frequency-aware shared and self-adaptive expert framework effectively addresses the adaptation-forgetting trade-off in CTTA, enabling stable and realistic adaptation to evolving domains while retaining reusable knowledge.

Abstract: This paper focuses on the Continual Test-Time Adaptation (CTTA) task, aiming to enable an agent to continuously adapt to evolving target domains while retaining previously acquired domain knowledge for effective reuse when those domains reappear. Existing shared-parameter paradigms struggle to balance adaptation and forgetting, leading to decreased efficiency and stability. To address this, we propose a frequency-aware shared and self-adaptive expert framework, consisting of two key components: (i) a dual-branch expert architecture that extracts general features and dynamically models domain-specific representations, effectively reducing cross-domain interference and repetitive learning cost; and (ii) an online Frequency-aware Domain Discriminator (FDD), which leverages the robustness of low-frequency image signals for online domain shift detection, guiding dynamic allocation of expert resources for more stable and realistic adaptation. Additionally, we introduce a Continual Repeated Shifts (CRS) benchmark to simulate periodic domain changes for more realistic evaluation. Experimental results show that our method consistently outperforms existing approaches on both classification and segmentation CTTA tasks under standard and CRS settings, with ablations and visualizations confirming its effectiveness and robustness. Our code is available at https://github.com/ZJC25127/Domain-Self-Adaptive-CTTA.git.

[160] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang

Main category: cs.CV

TL;DR: SIFThinker is a spatially-aware multimodal framework that uses depth-enhanced bounding boxes and natural language to iteratively correct attention and focus on relevant image regions, improving spatial understanding and fine-grained perception in MLLMs.

DetailsMotivation: Current MLLMs struggle with complex visual tasks like spatial understanding and fine-grained perception. Existing methods fail to leverage attention correction with spatial cues to iteratively refine focus on prompt-relevant regions.

Method: SIFThinker uses a “think-with-images” framework with depth-enhanced bounding boxes interleaved with natural language. It employs a reverse-expansion-forward-inference strategy to generate image-text chains of thought for supervision, creating the SIF-50K dataset. Also proposes GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline.

Result: Extensive experiments show SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception while maintaining strong general capabilities.

Conclusion: SIFThinker effectively addresses MLLM limitations in complex visual tasks by enabling dynamic attention correction and region focusing through spatial awareness, demonstrating the value of integrating depth information and iterative refinement in visual reasoning.

Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.

[161] S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

Xudong Liu, Zikun Chen, Ruowei Jiang, Ziyi Wu, Kejia Yin, Han Zhao, Parham Aarabi, Igor Gilitschenski

Main category: cs.CV

TL;DR: S²Edit is a diffusion-based method for personalized image editing with precise semantic and spatial control, addressing identity preservation and localized editing issues in face editing tasks.

DetailsMotivation: Existing diffusion methods for image editing often lose identity information and high-frequency details, or alter irrelevant regions due to concept entanglement, especially in fine-grained tasks like face editing.

Method: Fine-tunes a pre-trained text-to-image diffusion model to embed identity into a learnable text token with orthogonality constraints for attribute disentanglement, and uses object masks to guide cross-attention maps for spatial localization.

Result: Extensive experiments show S²Edit outperforms state-of-the-art methods both quantitatively and qualitatively, enabling faithful identity preservation and localized editing.

Conclusion: S²Edit provides an effective solution for personalized image editing with precise semantic and spatial control, demonstrated through applications like makeup transfer.

Abstract: Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.

[162] MatDecompSDF: High-Fidelity 3D Shape and PBR Material Decomposition from Multi-View Images

Chengyu Wang, Isabella Bennett, Henry Scott, Liang Zhang, Mei Chen, Hao Li, Rui Zhao

Main category: cs.CV

TL;DR: MatDecompSDF is a framework that jointly recovers 3D shapes and decomposes physically-based material properties from multi-view images using neural SDF for geometry, neural fields for materials, and MLP for lighting, with differentiable rendering and physical priors.

DetailsMotivation: The core challenge of inverse rendering is the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Existing methods struggle to achieve robust decomposition of physically-based material properties while maintaining geometric accuracy.

Method: Joint optimization of three neural components: neural SDF for geometry, spatially-varying neural field for PBR material parameters (albedo, roughness, metallic), and MLP-based model for environmental lighting. Uses physically-based differentiable rendering layer with carefully designed physical priors and geometric regularizations including material smoothness loss and Eikonal loss.

Result: Extensive experiments on synthetic and real-world datasets (DTU) demonstrate superior performance over state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Produces editable and relightable assets compatible with standard graphics pipelines.

Conclusion: MatDecompSDF effectively addresses the ill-posed inverse rendering problem, achieving robust decomposition of geometry and materials while producing practical, editable assets for digital content creation with validated utility in graphics pipelines.

Abstract: We present MatDecompSDF, a novel framework for recovering high-fidelity 3D shapes and decomposing their physically-based material properties from multi-view images. The core challenge of inverse rendering lies in the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Our method addresses this by jointly optimizing three neural components: a neural Signed Distance Function (SDF) to represent complex geometry, a spatially-varying neural field for predicting PBR material parameters (albedo, roughness, metallic), and an MLP-based model for capturing unknown environmental lighting. The key to our approach is a physically-based differentiable rendering layer that connects these 3D properties to the input images, allowing for end-to-end optimization. We introduce a set of carefully designed physical priors and geometric regularizations, including a material smoothness loss and an Eikonal loss, to effectively constrain the problem and achieve robust decomposition. Extensive experiments on both synthetic and real-world datasets (e.g., DTU) demonstrate that MatDecompSDF surpasses state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Crucially, our method produces editable and relightable assets that can be seamlessly integrated into standard graphics pipelines, validating its practical utility for digital content creation.

[163] Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models

L’ea Dubois, Klaus Schmidt, Chengyu Wang, Ji-Hoon Park, Lin Wang, Santiago Munoz

Main category: cs.CV

TL;DR: A novel framework fuses Vision Foundation Models with Large Language Models for advanced video reasoning tasks like causal analysis and future prediction, using a Q-Former inspired fusion module and two-stage training.

DetailsMotivation: Current video understanding models are limited to basic recognition tasks and lack commonsense world knowledge needed for high-level cognitive tasks like causal reasoning and future prediction.

Method: Proposes a framework that fuses a Vision Foundation Model (VFM) for visual perception with an LLM as a knowledge-driven reasoning core. Uses a Q-Former inspired fusion module to distill spatiotemporal and object-centric features into language-aligned representations. Employs two-stage training: large-scale video-text alignment pre-training followed by targeted instruction fine-tuning on curated reasoning datasets.

Result: Achieves state-of-the-art performance on multiple challenging benchmarks, demonstrates remarkable zero-shot generalization to unseen reasoning tasks, and ablation studies validate the importance of each architectural component.

Conclusion: The work advances machine perception from simple recognition towards genuine cognitive understanding, enabling more intelligent AI systems for robotics, human-computer interaction, and other applications.

Abstract: Current video understanding models excel at recognizing “what” is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond.

[164] AlignFreeNet: Is Cross-Modal Pre-Alignment Necessary? An End-to-End Alignment-Free Lightweight Network for Visible-Infrared Object Detection

Dingkun Zhu, Haote Zhang, Lipeng Gu, Wuzhou Quan, Fu Lee Wang, Honghui Fan, Jiali Tang, Haoran Xie, Xiaoping Zhang, Mingqiang Wei

Main category: cs.CV

TL;DR: AlignFreeNet: An alignment-free fusion network for visible-infrared object detection that avoids explicit alignment modules, using variation-guided compensation and frequency-guided fusion to handle severe cross-modal misalignments.

DetailsMotivation: Existing visible-infrared object detection methods use explicit alignment modules (pixel- or feature-level) that struggle with severe or mixed misalignments, introducing noise and limiting detection performance.

Method: Proposes AlignFreeNet with two core modules: 1) Variation-guided cross-modal compensation (VCC) adaptively feeds compensated information from cross-modal discrepancies back into each modality, and 2) Frequency-guided cross-modal fusion (FCF) suppresses task-irrelevant redundancy via frequency-domain gating.

Result: Extensive evaluations on DVTOD, M3FD, and DroneVehicle datasets show state-of-the-art performance under severe mixed misalignment conditions, demonstrating robustness and generalization.

Conclusion: AlignFreeNet’s alignment-free fusion paradigm effectively mitigates cross-modal misalignments without introducing noise, outperforming existing alignment-based methods in challenging visible-infrared object detection scenarios.

Abstract: Cross-modal misalignments, such as spatial offsets, resolution discrepancies, and semantic deficiencies, frequently occur in visible-infrared object detection (VI-OD). To mitigate this, existing methods are typically adapted into an alignment-based fusion paradigm, in which an explicit pixel- or feature-level alignment module is inserted before cross-modal fusion. However, pixel-level alignment struggles to cope with severe or mixed misalignments, whereas feature-level alignment often introduces undesirable noise into fused representations under such conditions, ultimately limiting detection performance. In this paper, we propose a novel alignment-free network (AlignFreeNet) for VI-OD. Differing from prior methods, AlignFreeNet abandons any explicit alignment and instead adopts an alignment-free fusion paradigm. Specifically, AlignFreeNet comprises two core modules: variation-guided cross-modal compensation (VCC) and frequency-guided cross-modal fusion (FCF). VCC adaptively feeds the compensated information derived from cross-modal discrepancies back into each modality, enhancing visible and infrared representations without the noise caused by explicit alignment. FCF achieves robust cross-modal fusion by suppressing task-irrelevant redundancy via frequency-domain gating, effectively mitigating noise introduced in the process. Moreover, VCC and FCF jointly exploit low- and high-frequency cues to preserve foreground contours in fused representations, effectively mitigating cross-modal blending caused by severe mixed misalignments. Extensive evaluations on DVTOD, M3FD, and DroneVehicle demonstrate that our AlignFreeNet achieves state-of-the-art performance under severe mixed misalignment conditions, highlighting its robustness and generalization.

[165] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani

Main category: cs.CV

TL;DR: ChatENV is an interactive vision-language model that combines satellite imagery with environmental sensor data for temporal and scenario-based reasoning about environmental changes.

DetailsMotivation: Current vision language models overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning for environmental monitoring applications.

Method: Created a 177k-image dataset with 152k temporal pairs across 62 land-use classes in 197 countries with sensor metadata; annotated using GPT4o and Gemini 2.0 for diversity; fine-tuned Qwen-2.5-VL using LoRA adapters for chat capabilities.

Result: Achieves strong performance in temporal and “what-if” reasoning (BERTF1 0.902), rivals or outperforms state-of-the-art temporal models, and supports interactive scenario-based analysis.

Conclusion: ChatENV positions as a powerful tool for grounded, sensor-aware environmental monitoring by integrating satellite imagery with real-world sensor data for interactive reasoning.

Abstract: Understanding environmental changes from remote sensing imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and “what-if” reasoning (e.g., BERTF1 0.902) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

[166] OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Minjie Ju, PeterYM Woo, Yixuan Yuan

Main category: cs.CV

TL;DR: OmniBrainBench is a comprehensive multimodal VQA benchmark for brain imaging analysis, covering 15 imaging modalities with 9,527 VQA pairs, revealing significant gaps between MLLMs and physicians.

DetailsMotivation: Current brain imaging VQA benchmarks are limited in modality coverage and pathological granularity, hindering comprehensive assessment of MLLMs across the full clinical continuum.

Method: Created OmniBrainBench with 15 brain imaging modalities from 30 medical sources, 9,527 validated VQA pairs, 31,706 images, simulating clinical workflows with 15 multi-stage clinical tasks validated by radiologists.

Result: Evaluation of 24 SOTA models shows proprietary MLLMs like GPT-5 (63.37%) outperform others but lag far behind physicians (91.35%). Medical MLLMs show wide variance, open-source ones trail but excel in specific tasks, all fail in complex preoperative reasoning.

Conclusion: OmniBrainBench establishes a new standard for assessing MLLMs in brain imaging analysis, highlighting significant visual-to-clinical gaps and the need for improved clinical reasoning capabilities.

Abstract: Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual question-answering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like GPT-5 (63.37%) outperform others yet lag far behind physicians (91.35%), while medical ones show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all ones fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians. We publicly release our benchmark at link.

[167] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation

Chenfan Qu, Yiwu Zhong, Huiguo He, Bin Li, Lianwen Jin

Main category: cs.CV

TL;DR: Novel approach for image manipulation localization using web data to address annotation scarcity, featuring auto-annotation framework, quality filtering, large dataset creation, and new model achieving SOTA performance.

DetailsMotivation: Image manipulation poses security risks, but accurate localization is hindered by severe scarcity of high-quality annotated data, which is labor-intensive to create manually.

Method: Proposes CAAAv2 auto-annotation framework with category-aware, prior-feature-denoising paradigm; QES metric for filtering low-quality annotations; constructs MIMLv2 dataset (246K images); introduces Object Jitter for artifact generation; develops Web-IML model for web-scale supervision.

Result: Creates dataset 120x larger than existing handcrafted datasets; Web-IML achieves 31% performance gain and surpasses previous SOTA SparseViT by 21.6 average IoU points on real-world forgery benchmarks.

Conclusion: The approach effectively mitigates data scarcity problem, significantly improves manipulation localization performance, and demonstrates the power of leveraging web data with quality-controlled auto-annotation for computer vision tasks.

Abstract: Images manipulated by image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing manipulated image regions remains challenging due to the severe scarcity of high-quality annotated data, which is laborious to create. To address this, we propose a novel approach that mitigates data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization.Specifically, we introduce CAAAv2, a novel auto-annotation framework that operates on a category-aware, prior-feature-denoising paradigm that notably reduces task complexity. To further ensure annotation reliability, we propose QES, a novel metric that filters out low-quality annotations. Combining CAAAv2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120 times larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop Web-IML, a new model designed to effectively leverage web-scale supervision for the task of image manipulation localization. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, our Web-IML achieves a striking performance gain of 31% and surpasses the previous state-of-the-art SparseViT by 21.6 average IoU points. The dataset and code will be released at https://github.com/qcf-568/MIML.

[168] Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu

Main category: cs.CV

TL;DR: V-PAE enables single-step video generation from large diffusion models via adversarial distillation, achieving 100x speedup while maintaining quality.

DetailsMotivation: Current video acceleration methods adapted from image techniques lack single-step distillation capability for large-scale video models and task generalization for conditional downstream tasks like image-to-video generation.

Method: Two-phase framework: (1) Stability priming to align real/generated video distributions, (2) Unified adversarial equilibrium using generator parameters for discriminator backbone to achieve co-evolutionary equilibrium in Gaussian noise space. Preserves video-image subject consistency for conditional tasks.

Result: Outperforms existing acceleration methods by average 5.8% in overall quality score (semantic alignment, temporal coherence, frame quality). Reduces diffusion latency of large-scale models (e.g., Wan2.1-I2V-14B) by 100x while preserving competitive performance.

Conclusion: V-PAE successfully bridges the gap for single-step distillation of large-scale video models, enabling efficient high-quality video generation with strong performance on conditional tasks.

Abstract: Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

[169] Dynamic LRP-Based Pruning for CNNs in Data-Scarce Transfer Learning: Suppressing Cascading Accuracy Degradation

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

Main category: cs.CV

TL;DR: Proposed LRP-based dynamic pruning method to compress pre-trained CNNs while preserving accuracy in small-data environments, addressing cascading accuracy degradation in existing LRP pruning approaches.

DetailsMotivation: When using pre-trained CNNs as fixed feature extractors for small-data tasks, many filters remain unused, creating redundancy and inefficiency. Existing LRP-based pruning methods cause cascading accuracy degradation, so a better approach is needed.

Method: Proposed a dynamic pruning method using Layer-wise Relevance Propagation (LRP) that suppresses cascading accuracy degradation. LRP quantifies filter contributions to inference results, enabling pruning of low-relevance filters while preserving task-specific performance.

Result: The proposed method effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods in small-data environments.

Conclusion: The LRP-based dynamic pruning method successfully compresses pre-trained models while maintaining task-specific performance, offering an effective solution for model efficiency in data-scarce scenarios.

Abstract: Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we propose an LRP-based dynamic pruning method that suppresses this cascading accuracy degradation and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that the proposed method effectively mitigates the cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods.

[170] A-TDOM: Active TDOM via On-the-Fly 3DGS

Yiwei Xu, Xiang Wang, Yifei Yu, Wentian Gan, Luca Morelli, Giulio Perda, Xin Wang, Zongqian Zhan, Fabio Remondino

Main category: cs.CV

TL;DR: A-TDOM is a near real-time True Digital Orthophoto Map generation method using incremental 3D Gaussian Splatting optimization to overcome traditional offline pipeline latency and quality issues.

DetailsMotivation: Traditional TDOM generation relies on complex offline photogrammetric pipelines with substantial latency, making it unsuitable for time-critical applications. Quality also suffers from inaccurate camera poses, imperfect DSM, and incorrect occlusion detection.

Method: Uses On-the-Fly 3DGS optimization where each incoming image’s pose and sparse point cloud are computed via On-the-Fly SfM. New regions are incrementally reconstructed using Delaunay triangulated Gaussian sampling and integration, optimized via adaptive training iterations and learning rates. Orthogonal splatting is integrated into the rendering pipeline.

Result: A-TDOM can actively produce updated TDOM outputs immediately after each 3DGS update, enabling near real-time TDOM generation with code available on GitHub.

Conclusion: The proposed A-TDOM method addresses latency and quality issues in traditional TDOM generation by leveraging incremental 3D Gaussian Splatting optimization for near real-time performance.

Abstract: True Digital Orthophoto Map (TDOM), a 2D objective representation of the Earth’s surface, is an essential geospatial product widely used in urban management, city planning, land surveying, and related applications. However, traditional TDOM generation typically relies on a complex offline photogrammetric pipeline, leading to substantial latency and making it unsuitable for time-critical or real-time scenarios. Moreover, the quality of TDOM may deteriorate due to inaccurate camera poses, imperfect Digital Surface Model (DSM), and incorrect occlusions detection. To address these challenges, this work introduces A-TDOM, a near real-time TDOM generation method built upon On-the-Fly 3DGS (3D Gaussian Splatting) optimization. As each incoming image arrives, its pose and sparse point cloud are computed via On-the-Fly SfM. Newly observed regions are then incrementally reconstructed as additional 3D Gaussians are inserted using a Delaunay triangulated Gaussian sampling and integration and are further optimized via adaptive training iterations and learning rate, especially in previously unseen or coarsely modeled areas. With orthogonal splatting integrated into the rendering pipeline, A-TDOM can actively produce updated TDOM outputs immediately after each 3DGS update. Code is now available at https://github.com/xywjohn/A-TDOM.

[171] SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging

Haijin Zeng, Xuan Lu, Yurong Zhang, Qiangqiang Shen, Guoqing Chao, Li Jiang, Yongyong Chen

Main category: cs.CV

TL;DR: SlowFast-SCI is a dual-speed deep unfolding framework for spectral compressive imaging that combines slow pre-training with fast test-time adaptation to handle new optical configurations efficiently.

DetailsMotivation: Existing deep-unfolding methods for spectral compressive imaging rely on heavy pre-training but lack rapid adaptation to new optical configurations, causing poor performance on out-of-distribution cameras and bespoke spectral setups, while also being computationally expensive.

Method: A dual-speed framework with slow learning (pre-training a priors-based backbone and distilling it into a compact fast-unfolding model) and fast learning (lightweight adaptation modules trained self-supervised at test time via dual-domain loss without retraining the backbone).

Result: Achieves over 70% reduction in parameters and FLOPs, up to 5.79 dB PSNR improvement on out-of-distribution data, preserved cross-domain adaptability, and 4x faster adaptation speed.

Conclusion: SlowFast-SCI successfully bridges the gap between offline robustness and on-the-fly adaptation for spectral compressive imaging, offering a modular framework that can integrate with any deep-unfolding network for self-adaptive, field-deployable imaging.

Abstract: Humans learn in two complementary ways: a slow, cumulative process that builds broad, general knowledge, and a fast, on-the-fly process that captures specific experiences. Existing deep-unfolding methods for spectral compressive imaging (SCI) mirror only the slow component-relying on heavy pre-training with many unfolding stages-yet they lack the rapid adaptation needed to handle new optical configurations. As a result, they falter on out-of-distribution cameras, especially in bespoke spectral setups unseen during training. This depth also incurs heavy computation and slow inference. To bridge this gap, we introduce SlowFast-SCI, a dual-speed framework seamlessly integrated into any deep unfolding network beyond SCI systems. During slow learning, we pre-train or reuse a priors-based backbone and distill it via imaging guidance into a compact fast-unfolding model. In the fast learning stage, lightweight adaptation modules are embedded within each block and trained self-supervised at test time via a dual-domain loss-without retraining the backbone. To the best of our knowledge, SlowFast-SCI is the first test-time adaptation-driven deep unfolding framework for efficient, self-adaptive spectral reconstruction. Its dual-stage design unites offline robustness with on-the-fly per-sample calibration-yielding over 70% reduction in parameters and FLOPs, up to 5.79 dB PSNR improvement on out-of-distribution data, preserved cross-domain adaptability, and a 4x faster adaptation speed. In addition, its modularity integrates with any deep-unfolding network, paving the way for self-adaptive, field-deployable imaging and expanded computational imaging modalities. The models, datasets, and code are available at https://github.com/XuanLu11/SlowFast-SCI.

[172] Degradation-Aware All-in-One Image Restoration via Latent Prior Encoding

S M A Sharif, Abdur Rehman, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi

Main category: cs.CV

TL;DR: A new all-in-one image restoration method that learns latent degradation priors from inputs without explicit task cues, enabling adaptive feature selection, spatial localization, and degradation-aware restoration across diverse and mixed degradations.

DetailsMotivation: Existing all-in-one restoration methods rely on external text prompts or hand-crafted architectural priors, which impose brittle assumptions that limit generalization to unseen or mixed degradations in real-world images.

Method: Reframes AIR as learned latent prior inference, automatically inferring degradation-aware representations from inputs. Uses a structured reasoning paradigm with adaptive feature selection, spatial localization, and degradation semantics, implemented via a lightweight decoding module for spatially-adaptive restoration.

Result: Outperforms SOTA approaches across six common degradation tasks, five compound settings, and previously unseen degradations, achieving average PSNR improvement of 1.68 dB while being three times more efficient.

Conclusion: The proposed latent prior inference approach effectively addresses limitations of existing AIR methods, providing superior generalization to diverse and mixed degradations with improved efficiency and performance.

Abstract: Real-world images often suffer from spatially diverse degradations such as haze, rain, snow, and low-light, significantly impacting visual quality and downstream vision tasks. Existing all-in-one restoration (AIR) approaches either depend on external text prompts or embed hand-crafted architectural priors (e.g., frequency heuristics); both impose discrete, brittle assumptions that weaken generalization to unseen or mixed degradations. To address this limitation, we propose to reframe AIR as learned latent prior inference, where degradation-aware representations are automatically inferred from the input without explicit task cues. Based on latent priors, we formulate AIR as a structured reasoning paradigm: (1) which features to route (adaptive feature selection), (2) where to restore (spatial localization), and (3) what to restore (degradation semantics). We design a lightweight decoding module that efficiently leverages these latent encoded cues for spatially-adaptive restoration. Extensive experiments across six common degradation tasks, five compound settings, and previously unseen degradations demonstrate that our method outperforms state-of-the-art (SOTA) approaches, achieving an average PSNR improvement of 1.68 dB while being three times more efficient.

[173] DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling

Timur Mamedov, Anton Konushin, Vadim Konushin

Main category: cs.CV

TL;DR: DynaMix is a novel generalizable person re-identification method that combines labeled multi-camera data with large-scale pseudo-labeled single-camera data using three dynamic components for efficient large-scale training.

DetailsMotivation: Existing generalizable person Re-ID methods rely heavily on limited labeled multi-camera data, which restricts their ability to scale and generalize effectively across unseen cameras and environments.

Method: DynaMix uses three core components: (1) Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly, (2) Efficient Centroids Module that maintains robust identity representations in large identity spaces, and (3) Data Sampling Module that composes mixed mini-batches to balance learning complexity and intra-batch diversity.

Result: Extensive experiments show DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID, enabling effective training on millions of images and hundreds of thousands of identities.

Conclusion: DynaMix successfully addresses the scalability challenge in generalizable person Re-ID by dynamically adapting to training data structure and noise, combining limited labeled data with large-scale pseudo-labeled data for superior generalization performance.

Abstract: Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.

Xiafeng Man, Zhipeng Wei, Jingjing Chen

Main category: cs.CV

TL;DR: DPM is a novel framework that detects copyright infringement in diffusion models using differential privacy principles, without needing original training data or prompts.

DetailsMotivation: Large vision models like Stable Diffusion can memorize and reproduce copyrighted content without authorization, raising legal/ethical concerns. Existing detection methods lack robustness and theoretical foundations.

Method: Formalizes copyright infringement detection using Differential Privacy principles. Introduces conditional sensitivity metric to quantify output deviation from specific training data. Proposes DPM framework that fine-tunes models in two opposing directions (learning/unlearning) and computes confidence scores over orthogonal prompt distributions using statistical metrics.

Result: DPM reliably detects infringement content without requiring access to original training dataset or text prompts. Also created Copyright Infringement Detection Dataset (CIDD) for standardized benchmarking.

Conclusion: DPM offers an interpretable and practical solution for safeguarding intellectual property in generative AI, addressing the gap between theoretical foundations and practical detection needs.

Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model’s output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.

[175] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering

Jian Zhu, Xin Zou, Jun Sun, Cheng Luo, Lei Liu, Lingfang Zeng, Ning Zhang, Bian Wu, Chang Tang, Lirong Dai

Main category: cs.CV

TL;DR: MoEGCL introduces fine-grained ego-graph fusion at sample level using Mixture-of-Experts network, outperforming coarse view-level fusion methods in multi-view clustering.

DetailsMotivation: Existing GNN-based multi-view clustering methods suffer from coarse-grained graph fusion, where separate graph structures for each view are fused at view level, which is a rough strategy that limits performance.

Method: Two main modules: 1) Mixture of Ego-Graphs Fusion (MoEGF) constructs ego graphs and uses Mixture-of-Experts network for fine-grained fusion at sample level; 2) Ego Graph Contrastive Learning (EGCL) aligns fused representation with view-specific representation, enhancing similarity of samples from same cluster.

Result: Extensive experiments show MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks.

Conclusion: MoEGCL addresses the coarse-grained fusion limitation through fine-grained sample-level ego-graph fusion and contrastive learning, significantly improving multi-view clustering performance.

Abstract: In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.

[176] RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System

Runwei Guan, Rongsheng Hu, Shangshu Chen, Ningyuan Xiao, Xue Xia, Jiayang Liu, Beibei Chen, Ziren Tang, Ningwei Ouyang, Shaofeng Liang, Yuxuan Fan, Wanjie Sun, Yutao Yue

Main category: cs.CV

TL;DR: RoadSceneVQA: A large-scale VQA dataset for roadside perception with 34,736 QA pairs, plus CAF fusion module and AD-CoT reasoning method for state-of-the-art traffic scene understanding.

DetailsMotivation: Current roadside perception systems only do instance-level recognition but lack natural language interaction and contextual reasoning about traffic behaviors. Need to bridge this gap for more intelligent traffic scene understanding.

Method: 1) Create RoadSceneVQA dataset with diverse QA pairs covering weather/illumination/traffic variations. 2) Propose CogniAnchor Fusion (CAF) for vision-language fusion inspired by human scene anchoring. 3) Develop Assisted Decoupled Chain-of-Thought (AD-CoT) for enhanced reasoning via CoT prompting and multi-task learning. 4) Build baseline model RoadMind integrating these components.

Result: RoadMind achieves state-of-the-art performance on RoadSceneVQA and CODA-LM benchmarks, improving both reasoning accuracy and computational efficiency for structural traffic perception and reasoning tasks.

Conclusion: The proposed RoadSceneVQA dataset and RoadMind pipeline enable advanced natural language interaction and contextual reasoning in roadside perception, moving beyond simple instance recognition to comprehensive traffic scene understanding.

Abstract: Current roadside perception systems mainly focus on instance-level perception, which fall short in enabling interaction via natural language and reasoning about traffic behaviors in context. To bridge this gap, we introduce RoadSceneVQA, a large-scale and richly annotated visual question answering (VQA) dataset specifically tailored for roadside scenarios. The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions, targeting not only object attributes but also the intent, legality, and interaction patterns of traffic participants. RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning, grounded in real-world traffic rules and contextual dependencies. To fully exploit the reasoning potential of Multi-modal Large Language Models (MLLMs), we further propose CogniAnchor Fusion (CAF), a vision-language fusion module inspired by human-like scene anchoring mechanisms. Moreover, we propose the Assisted Decoupled Chain-of-Thought (AD-CoT) to enhance the reasoned thinking via CoT prompting and multi-task learning. Based on the above, we propose the baseline model RoadMind. Experiments on RoadSceneVQA and CODA-LM benchmark show that the pipeline consistently improves both reasoning accuracy and computational efficiency, allowing the MLLM to achieve state-of-the-art performance in structural traffic perception and reasoning tasks.

[177] Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning

Haiyang Zheng, Nan Pu, Wenjing Li, Teng Long, Nicu Sebe, Zhun Zhong

Main category: cs.CV

TL;DR: Proposes CAL framework for Open-World DeepFake Attribution with confidence-aware asymmetric learning and dynamic prototype pruning to handle known and unknown forgery types without prior knowledge of novel type count.

DetailsMotivation: Existing OW-DFA methods suffer from confidence skew causing unreliable pseudo-labels for novel forgeries, and unrealistic assumption that number of unknown forgery types is known beforehand.

Method: Confidence-Aware Asymmetric Learning (CAL) with two components: Confidence-Aware Consistency Regularization (CCR) to mitigate pseudo-label bias by scaling losses based on normalized confidence, and Asymmetric Confidence Reinforcement (ACR) to separately calibrate confidence for known/novel classes. Plus Dynamic Prototype Pruning (DPP) to automatically estimate novel forgery types.

Result: CAL consistently outperforms previous methods on standard OW-DFA benchmark and extended benchmark with advanced manipulations, achieving state-of-the-art performance for both known and novel forgery attribution.

Conclusion: The proposed CAL framework effectively addresses confidence skew and unrealistic assumptions in OW-DFA, providing robust attribution for both known and unknown synthetic facial forgeries in real-world scenarios.

Abstract: The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known a priori. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model’s OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.

[178] Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu

Main category: cs.CV

TL;DR: Motus is a unified latent action world model that integrates understanding, video generation, and action capabilities using a Mixture-of-Transformer architecture and optical flow-based latent actions, achieving significant performance improvements in robotic tasks.

DetailsMotivation: Current embodied AI systems use fragmented models for understanding, world modeling, and control, preventing unified multimodal generative capabilities and hindering learning from large-scale heterogeneous data.

Method: Proposes Motus with Mixture-of-Transformer architecture integrating three experts (understanding, video generation, action), uses UniDiffuser-style scheduler for flexible mode switching, leverages optical flow for latent action learning, and employs three-phase training with six-layer data pyramid.

Result: Achieves +15% improvement over X-VLA and +45% over Pi0.5 in simulation, and +11~48% improvement in real-world scenarios, demonstrating superior performance in robotic tasks.

Conclusion: Unified modeling of all functionalities and priors significantly benefits downstream robotic tasks, showing that integrated multimodal generative capabilities outperform fragmented approaches.

Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level “delta action” and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

[179] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

Main category: cs.CV

TL;DR: DAVE is a specialized vision encoder for VLMs that addresses the lack of structural/spatial information in existing encoders for document understanding and web agent tasks, using self-supervised pretraining on unlabeled data followed by supervised training with novel merging and ensemble techniques.

DetailsMotivation: Current vision-language models have a fundamental weakness: their vision encoders lack robust structural and spatial information needed for document understanding and web agent tasks, limiting their effectiveness in these domains.

Method: Two-stage training: 1) Self-supervised pretraining on unlabeled images, 2) Supervised autoregressive pretraining with limited high-quality data. Uses model-merging scheme to combine encoders trained with different text decoders for broad compatibility, and ensemble training to fuse features from generalist encoders with domain-specific representations.

Result: Extensive experiments on document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of DAVE, establishing it as a strong vision encoder for document and web applications.

Conclusion: DAVE successfully bridges the gap in vision encoders for VLMs by providing robust structural and spatial information essential for document understanding and web agent tasks, while leveraging abundant unlabeled data to avoid costly annotations.

Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder’s alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

[180] Multi-Part Object Representations via Graph Structures and Co-Part Discovery

Alex Foo, Wynne Hsu, Mong Li Lee

Main category: cs.CV

TL;DR: A novel object-centric representation method using explicit graph representations for parts that improves multi-part object discovery and recognition in occluded/out-of-distribution settings.

DetailsMotivation: Current implicit object representation approaches fail to recognize learned objects in occluded or out-of-distribution contexts because they assume object part-whole relations are implicitly encoded through indirect training objectives.

Method: Proposes a novel method leveraging explicit graph representations for parts and presents a co-part object discovery algorithm. Also introduces three benchmarks to evaluate robustness in occluded and out-of-distribution settings.

Result: Experimental results on simulated, realistic, and real-world images show marked improvements in quality of discovered objects compared to SOTA methods, and accurate recognition of multi-part objects in occluded/out-of-distribution contexts.

Conclusion: The discovered object-centric representations can more accurately predict key object properties in downstream tasks, highlighting the method’s potential to advance object-centric representation field.

Abstract: Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.

[181] Total Normal Curvature Regularization and its Minimization for Surface and Image Smoothing

Tianle Lu, Ke Chen, Yuping Duan

Main category: cs.CV

TL;DR: Novel curvature regularization method using total normal curvature from multiple directions for sharp edges and isotropic smoothing, solved via PDE operator splitting.

DetailsMotivation: Need for curvature regularization methods that can produce solutions with sharp edges and precise isotropic properties while avoiding complex parameter tuning.

Method: Formulate total normal curvature regularization from multiple directions, reformulate as steady-state PDE system, use operator splitting for time discretization with closed-form or efficient subproblem solutions.

Result: Method demonstrates robustness to parameter choices, circumvents complex tuning, and is validated for surface and image smoothing problems.

Conclusion: The proposed total normal curvature regularization with PDE operator splitting provides an efficient, effective, and robust approach for curvature-based smoothing with sharp edge preservation.

Abstract: We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.

[182] Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis

Xiaoming Zhang, Chunli Li, Jiacheng Hao, Yuan Gao, Danyang Tu, Jianyi Qiao, Xiaoli Yin, Le Lu, Ling Zhang, Ke Yan, Yang Hou, Yu Shi

Main category: cs.CV

TL;DR: MOON++ is a multimodal framework using non-contrast CT scans to assess esophageal varices severity by analyzing esophagus, liver, and spleen relationships, achieving superior performance over single-organ methods.

DetailsMotivation: Esophageal varices affect 60% of cirrhosis patients with high bleeding risk, traditionally requiring invasive endoscopy. Non-contrast CT offers a potential non-invasive alternative but is underutilized clinically.

Method: MOON++ synthesizes imaging characteristics of esophagus, liver, and spleen through multimodal learning, incorporating clinical knowledge priors about organ volumetric relationships with liver disease severity.

Result: Achieved AUC of 0.894 vs 0.803 for severe grade classification (G3 vs <G3) and 0.921 vs 0.793 for moderate-to-severe differentiation (>=G2 vs <G2), validated through reader studies with radiologists.

Conclusion: MOON++ represents the first comprehensive multi-organ NCCT analysis framework for EV assessment, potentially offering a promising non-invasive diagnostic alternative to invasive endoscopy.

Abstract: Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of ~30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1,631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus <G3) and 0.921 versus 0.793 for the differentiation of moderate to severe grades (>=G2 versus <G2). We conducted a reader study involving experienced radiologists to further validate the performance of MOON++. To our knowledge, MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, potentially offering a promising non-invasive diagnostic alternative.

[183] D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan, Shouhong Ding, Biqing Qi, Linfeng Zhang

Main category: cs.CV

TL;DR: D2Pruner is a token pruning framework for MLLMs that combines debiased importance scoring with structural pruning to handle both general understanding and fine-grained localization tasks efficiently.

DetailsMotivation: Current token pruning methods for MLLMs work for general understanding but fail catastrophically on fine-grained localization tasks due to positional bias in importance-based methods and structural blindness in diversity-based methods.

Method: D2Pruner first selects critical tokens as pivots using debiased attention scores, then performs Maximal Independent Set selection on remaining tokens modeled on a hybrid graph considering spatial proximity and semantic similarity to maximize importance and diversity.

Result: For LLaVA-1.5-7B on general tasks: 74.2% FLOPs reduction while retaining 99.2% performance. For InternVL-2.5-8B on localization benchmarks: maintains 85.7% performance at 90% token reduction, with up to 63.53% improvement over existing methods.

Conclusion: D2Pruner effectively addresses limitations of existing token pruning methods by combining debiased importance with structural pruning, achieving exceptional efficiency and fidelity for both general understanding and fine-grained localization tasks in MLLMs.

Abstract: Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user’s prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2% while retaining 99.2% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7% performance at a 90% token reduction rate, marking a significant advancement with up to 63. 53% improvement over existing methods.

[184] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex, Chih-Wei Chang, Erik H Middlebrooks, Richard L. J Qiu, Pretesh Patel, Ashesh B. Jani, Hui Mao, Zhen Tian, Xiaofeng Yang

Main category: cs.CV

TL;DR: Novel MRI super-resolution framework combining multi-head selective state-space models with lightweight channel MLP achieves state-of-the-art performance with exceptional efficiency (0.9M parameters, 57 GFLOPs).

DetailsMotivation: High-resolution MRI is clinically important but limited by long acquisition times. Existing deep learning SR methods face trade-offs between fidelity and computational efficiency, hindering clinical integration.

Method: Proposed framework uses multi-head selective state-space models (MHSSM) with lightweight channel MLP, 2D patch extraction with hybrid scanning, and MambaFormer blocks integrating MHSSM, depthwise convolutions, and gated channel mixing.

Result: Superior performance on 7T brain (SSIM=0.951, PSNR=26.90 dB) and 1.5T prostate (SSIM=0.770, PSNR=27.15 dB) data, significantly outperforming all baselines. Achieved 99.8% parameter reduction and 97.5% computation reduction versus Res-SRDiff.

Conclusion: The framework provides an efficient, accurate MRI SR solution with enhanced anatomical detail and low computational demand, showing strong potential for clinical translation.

Abstract: Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.

[185] SemanticGen: Video Generation in Semantic Space

Jianhong Bai, Xiaoshi Wu, Xintao Wang, Xiao Fu, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai

Main category: cs.CV

TL;DR: SemanticGen: A two-stage video generation method that first generates compact semantic features for global planning, then adds high-frequency details, enabling faster convergence and more efficient long video generation compared to VAE latent space approaches.

DetailsMotivation: Current video generative models that operate in VAE latent space suffer from slow convergence and computational inefficiency, especially for long videos. The authors aim to address these limitations by leveraging the inherent redundancy in videos.

Method: Two-stage generation: 1) A diffusion model generates compact semantic video features for global layout planning, 2) Another diffusion model generates VAE latents conditioned on these semantic features to produce final output with high-frequency details.

Result: SemanticGen achieves faster convergence compared to VAE latent space generation, is computationally efficient for long video generation, and produces high-quality videos that outperform state-of-the-art approaches and strong baselines.

Conclusion: Generating videos in semantic space rather than directly in VAE latent space is more effective for global planning and computationally efficient, making it a promising approach for high-quality video generation, especially for longer sequences.

Abstract: State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

[186] Learning to Sense for Driving: Joint Optics-Sensor-Model Co-Design for Semantic Segmentation

Reeshad Khan, John Gauch

Main category: cs.CV

TL;DR: End-to-end RAW-to-task pipeline co-designing optics, sensors, and neural networks for autonomous driving perception, achieving better segmentation with compact models.

DetailsMotivation: Traditional autonomous driving pipelines separate camera design from perception, using fixed optics and handcrafted ISPs optimized for human viewing rather than machine perception. This discards information during processing and forces models to adapt to sensor artifacts.

Method: Task-driven co-design framework unifying optics, sensor modeling, and lightweight semantic segmentation networks into single end-to-end RAW-to-task pipeline. Integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives.

Result: Consistent mIoU improvements over fixed pipelines on KITTI-360, with optics modeling and CFA learning providing largest gains, especially for thin or low-light-sensitive classes. Achieved with compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth.

Conclusion: Full-stack co-optimization of optics, sensors, and networks establishes principled path toward efficient, reliable, and deployable perception in autonomous systems.

Abstract: Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.

[187] UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Yuhan Wang, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan

Main category: cs.CV

TL;DR: UltraShape 1.0 is a scalable 3D diffusion framework for high-fidelity geometry generation using a two-stage pipeline: coarse structure synthesis followed by detailed refinement, with novel data processing and spatial-geometric decoupling.

DetailsMotivation: To address the challenge of generating high-quality 3D geometry with limited training resources, while overcoming issues with existing 3D datasets that contain low-quality samples, holes, and thin structures.

Method: Two-stage generation pipeline: 1) Coarse global structure synthesis, 2) Detailed refinement using voxel-based refinement at fixed spatial locations with RoPE encoding. Includes comprehensive data processing with watertight processing and quality filtering.

Result: Achieves competitive performance with existing open-source methods in both data processing quality and geometry generation, trained exclusively on publicly available 3D datasets despite limited resources.

Conclusion: UltraShape 1.0 demonstrates effective 3D geometry generation through a scalable diffusion framework with novel data processing and spatial-geometric decoupling, with code and models to be released for research.

Abstract: In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

[188] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu

Main category: cs.CV

TL;DR: DreaMontage is a framework for generating seamless, expressive long-duration one-shot videos from arbitrary user inputs, overcoming limitations of existing video generation methods through adaptive tuning, visual expression fine-tuning, and memory-efficient inference.

DetailsMotivation: One-shot filmmaking is aesthetically sophisticated but prohibitively expensive and constrained in reality. Existing video generation models rely on naive clip concatenation that fails to maintain visual smoothness and temporal coherence, creating a need for better virtual alternatives.

Method: Three-pronged approach: 1) Lightweight intermediate-conditioning mechanism in DiT architecture with Adaptive Tuning strategy for arbitrary-frame control; 2) High-quality dataset curation with Visual Expression SFT stage and Tailored DPO scheme for motion rationality and transition smoothness; 3) Segment-wise Auto-Regressive (SAR) inference strategy for memory-efficient long sequence generation.

Result: Extensive experiments demonstrate visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, enabling transformation of fragmented visual materials into cohesive cinematic experiences.

Conclusion: DreaMontage provides a comprehensive solution for generating high-quality one-shot videos from arbitrary inputs, addressing key challenges in visual fidelity, temporal coherence, and computational efficiency to empower users in creating cinematic experiences.

Abstract: The “one-shot” technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

[189] TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis, Maria Vakalopoulou, Dimitris Samaras

Main category: cs.CV

TL;DR: TICON is a transformer-based tile representation contextualizer that enriches tile embeddings with slide-level context, improving performance across computational pathology tasks and enabling a slide-level foundation model with far fewer WSIs.

DetailsMotivation: Standard tile encoder pipelines extract embeddings without slide-level context, which is essential for both local and global tasks in computational pathology. Different tile encoders excel at different tasks, creating a need for a unified model that can contextualize embeddings from any tile-level foundation model.

Method: TICON uses a single shared transformer encoder pretrained with a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. It also includes an aggregator to form slide-level representations.

Result: TICON-contextualized embeddings significantly improve performance across multiple tasks, establishing new SOTA results on tile-level benchmarks (HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (Patho-Bench). The slide-level foundation model trained with only 11K WSIs outperforms SOTA models trained with up to 350K WSIs.

Conclusion: TICON successfully addresses the need for contextualized tile representations in computational pathology, providing a unified approach that improves performance across diverse tasks while enabling efficient slide-level foundation modeling with significantly fewer training samples.

Abstract: The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ‘‘any’’ application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ‘‘any’’ tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

[190] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, Tao Xiang, Ziwei Liu, Juan-Manuel Perez-Rua

Main category: cs.CV

TL;DR: HiStream is an efficient autoregressive framework for high-resolution video generation that reduces computational redundancy through spatial, temporal, and timestep compression, achieving up to 107.5x speedup with minimal quality loss.

DetailsMotivation: High-resolution video generation is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible due to the massive computational requirements.

Method: HiStream uses three compression strategies: 1) Spatial compression - denoising at low resolution first then refining at high resolution with cached features; 2) Temporal compression - chunk-by-chunk processing with fixed-size anchor cache for stable inference; 3) Timestep compression - applying fewer denoising steps to subsequent cache-conditioned chunks.

Result: On 1080p benchmarks, HiStream achieves state-of-the-art visual quality with 76.2x faster denoising than Wan2.1 baseline and negligible quality loss. HiStream+ (with all three optimizations) achieves 107.5x acceleration, offering compelling speed-quality trade-off.

Conclusion: HiStream makes high-resolution video generation practical and scalable by dramatically reducing computational requirements while maintaining visual quality, addressing the fundamental bottleneck in diffusion-based video generation.

Abstract: High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

cs.AI

[191] From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration

Shuide Wen, Yu Sun, Beier Ku, Zhi Gao, Lijun Ma, Yang Yang, Can Jiao

Main category: cs.AI

TL;DR: MLLM-based multi-agent system achieves expert-level performance in interpreting HTP projective tests, offering standardized assessment with high ecological validity.

DetailsMotivation: The HTP test lacks standardized scoring, suffers from subjective interpretation, and needs quantitative coding systems, creating reliability issues in clinical practice.

Method: Developed a multi-agent framework using multimodal large language models that integrates social-psychological perspectives and destigmatizing narratives to interpret HTP drawings.

Result: MLLM interpretations achieved ~0.75 semantic similarity with human experts (rising to 0.85 in structured datasets), effectively corrected hallucinations, and produced coherent psychological reports.

Conclusion: Multimodal large models can serve as standardized projective assessment tools, with the multi-agent framework offering a new paradigm for digital mental health services.

Abstract: Background: The House-Tree-Person (HTP) drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology. However, it has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system. Results: Quantitative experiments showed that the mean semantic similarity between Multimodal Large Language Model (MLLM) interpretations and human expert interpretations was approximately 0.75 (standard deviation about 0.05). In structurally oriented expert data sets, this similarity rose to 0.85, indicating expert-level baseline comprehension. Qualitative analyses demonstrated that the multi-agent system, by integrating social-psychological perspectives and destigmatizing narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence. Conclusions: The findings confirm the potential of multimodal large models as standardized tools for projective assessment. The proposed multi-agent framework, by dividing roles, decouples feature recognition from psychological inference and offers a new paradigm for digital mental-health services. Keywords: House-Tree-Person test; multimodal large language model; multi-agent collaboration; cosine similarity; computational psychology; artificial intelligence

[192] A Study of Solving Life-and-Death Problems in Go Using Relevance-Zone Based Solvers

Chung-Chin Shih, Ti-Rong Wu, Ting Han Wei, Yu-Shan Hsu, Hung Guei, I-Chen Wu

Main category: cs.AI

TL;DR: This paper analyzes how current state-of-the-art computer Go solvers using Relevance-Zone Based Search (RZS) and relevance-zone pattern tables perform on Life-and-Death problems from Cho Chikun’s book, finding both successes and limitations compared to human expert solutions.

DetailsMotivation: To evaluate the effectiveness and limitations of modern computer Go solvers (using RZS and pattern tables) on classical Life-and-Death problems, comparing their solutions against expert human analysis from a renowned Go textbook.

Method: Used relevance-zone based solvers with RZS and relevance-zone pattern tables to analyze seven L&D problems from Cho Chikun’s “Life and Death Dictionary,” examining the solvers’ identified relevance-zones, discovered patterns, and comparing solutions against the book’s answers.

Result: Solvers successfully identified critical relevance-zones for each problem and discovered patterns including rare ones. However, they found different answers for two problems compared to the book, and exhibited two key issues: misjudging values of rare patterns and prioritizing direct living over territory maximization (unlike human players).

Conclusion: While relevance-zone based solvers show promise in analyzing L&D problems, they have systematic limitations in pattern evaluation and strategic priorities compared to human experts. The paper suggests approaches to address these issues and makes code/data available for further research.

Abstract: This paper analyzes the behavior of solving Life-and-Death (L&D) problems in the game of Go using current state-of-the-art computer Go solvers with two techniques: the Relevance-Zone Based Search (RZS) and the relevance-zone pattern table. We examined the solutions derived by relevance-zone based solvers on seven L&D problems from the renowned book “Life and Death Dictionary” written by Cho Chikun, a Go grandmaster, and found several interesting results. First, for each problem, the solvers identify a relevance-zone that highlights the critical areas for solving. Second, the solvers discover a series of patterns, including some that are rare. Finally, the solvers even find different answers compared to the given solutions for two problems. We also identified two issues with the solver: (a) it misjudges values of rare patterns, and (b) it tends to prioritize living directly rather than maximizing territory, which differs from the behavior of human Go players. We suggest possible approaches to address these issues in future work. Our code and data are available at https://rlg.iis.sinica.edu.tw/papers/study-LD-RZ.

[193] Three-way conflict analysis based on alliance and conflict functions

Junfang Luo, Mengjun Hu, Guangming Lang, Xin Yang, Keyun Qin

Main category: cs.AI

TL;DR: The paper proposes separating alliance and conflict functions in three-way conflict analysis to clarify semantics, addressing issues with standard aggregation methods that obscure meaningful differences in agent attitudes.

DetailsMotivation: Current three-way conflict analysis uses single functions that combine opposite aspects (alliance/conflict), making aggregations ambiguous. For example, averaging +1 (alliance) and -1 (conflict) yields 0, which is indistinguishable from neutrality, despite representing very different attitudes.

Method: Separates the two opposite aspects in auxiliary functions into distinct alliance and conflict functions. Uses these separate functions to trisect agents, issues, and agent pairs, then applies this framework to explore alliance sets and strategies in conflict analysis.

Result: Develops a clearer semantic framework for three-way conflict analysis that properly distinguishes between alliance, conflict, and neutrality. Provides applications for solving crucial questions in conflict analysis and demonstrates the approach with a real-world example.

Conclusion: Separating alliance and conflict functions provides more meaningful and interpretable results in three-way conflict analysis, addressing limitations of existing aggregation methods and enabling better analysis of agent relationships and strategies.

Abstract: Trisecting agents, issues, and agent pairs are essential topics of three-way conflict analysis. They have been commonly studied based on either a rating or an auxiliary function. A rating function defines the positive, negative, or neutral ratings of agents on issues. An auxiliary function defines the alliance, conflict, and neutrality relations between agents. These functions measure two opposite aspects in a single function, leading to challenges in interpreting their aggregations over a group of issues or agents. For example, when studying agent relations regarding a set of issues, a standard aggregation takes the average of an auxiliary function concerning single issues. Therefore, a pair of alliance +1 and conflict -1 relations will produce the same result as a pair of neutrality 0 relations, although the attitudes represented by the two pairs are very different. To clarify semantics, we separate the two opposite aspects in an auxiliary function into a pair of alliance and conflict functions. Accordingly, we trisect the agents, issues, and agent pairs and investigate their applications in solving a few crucial questions in conflict analysis. Particularly, we explore the concepts of alliance sets and strategies. A real-world application is given to illustrate the proposed models.

[194] Feasible strategies in three-way conflict analysis with three-valued ratings

Jing Liu, Mengjun Hu, Guangming Lang

Main category: cs.AI

TL;DR: This paper proposes new conflict resolution models that identify feasible strategies using weighted consistency and non-consistency measures, outperforming existing approaches by systematically finding optimal solutions.

DetailsMotivation: Existing three-way conflict analysis focuses on understanding conflicts but lacks practical resolution methods. There's insufficient attention to formulating feasible strategies, which are essential for actual conflict resolution and mitigation.

Method: 1) Compute overall rating of agent cliques using positive/negative similarity degrees; 2) Propose weighted consistency and non-consistency measures considering agent and issue weights; 3) Develop algorithms to identify feasible strategies, L-order feasible strategies, and optimal ones; 4) Apply to NBA labor negotiations and Gansu Province development case studies with sensitivity and comparative analysis.

Result: The proposed models demonstrate practicality, effectiveness, and superiority over existing approaches. They outperform conventional methods by unifying weighted agent-issue evaluation with consistency/non-consistency measures to systematically identify both feasible strategies and optimal solutions.

Conclusion: The paper successfully addresses the gap in conflict resolution by providing systematic methods to identify feasible and optimal strategies through weighted consistency/non-consistency measures, offering practical tools for real-world conflict analysis and resolution.

Abstract: Most existing work on three-way conflict analysis has focused on trisecting agent pairs, agents, or issues, which contributes to understanding the nature of conflicts but falls short in addressing their resolution. Specifically, the formulation of feasible strategies, as an essential component of conflict resolution and mitigation, has received insufficient scholarly attention. Therefore, this paper aims to investigate feasible strategies from two perspectives of consistency and non-consistency. Particularly, we begin with computing the overall rating of a clique of agents based on positive and negative similarity degrees. Afterwards, considering the weights of both agents and issues, we propose weighted consistency and non-consistency measures, which are respectively used to identify the feasible strategies for a clique of agents. Algorithms are developed to identify feasible strategies, $L$-order feasible strategies, and the corresponding optimal ones. Finally, to demonstrate the practicality, effectiveness, and superiority of the proposed models, we apply them to two commonly used case studies on NBA labor negotiations and development plans for Gansu Province and conduct a sensitivity analysis on parameters and a comparative analysis with existing state-of-the-art conflict analysis approaches. The comparison results demonstrate that our conflict resolution models outperform the conventional approaches by unifying weighted agent-issue evaluation with consistency and non-consistency measures to enable the systematic identification of not only feasible strategies but also optimal solutions.

[195] Three-way decision with incomplete information based on similarity and satisfiability

Junfang Luo, Mengjun Hu, Keyun Qin

Main category: cs.AI

TL;DR: This paper generalizes three-way decision from complete to incomplete information by extending both computational (equivalence relations) and conceptual (logic satisfiability) formulations with new similarity and satisfiability degree measures.

DetailsMotivation: Three-way decision with rough sets is well-established for complete information, but real-world applications often involve incomplete information. The paper aims to bridge this gap by generalizing existing formulations to handle incomplete data.

Method: Two main approaches: 1) Computational formulation using similarity degree measures (generalizing equivalence relations) with alpha-similarity classes and approximability; 2) Conceptual formulation using satisfiability degree measures (generalizing logic satisfiability) with alpha-meaning sets and confidence of formulas.

Result: The paper proposes novel measures for similarity degree and satisfiability degree, introduces the concept of approximability, and develops two approaches for each formulation to handle incomplete information in three-way decision making.

Conclusion: While similarity classes are common for incomplete information analysis, the proposed approximability concept and conceptual formulation approaches offer promising new directions for three-way decision with incomplete data in practical applications.

Abstract: Three-way decision is widely applied with rough set theory to learn classification or decision rules. The approaches dealing with complete information are well established in the literature, including the two complementary computational and conceptual formulations. The computational formulation uses equivalence relations, and the conceptual formulation uses satisfiability of logic formulas. In this paper, based on a briefly review of these two formulations, we generalize both formulations into three-way decision with incomplete information that is more practical in real-world applications. For the computational formulation, we propose a new measure of similarity degree of objects as a generalization of equivalence relations. Based on it, we discuss two approaches to three-way decision using alpha-similarity classes and approximability of objects, respectively. For the conceptual formulation, we propose a measure of satisfiability degree of formulas as a quantitative generalization of satisfiability with complete information. Based on it, we study two approaches to three-way decision using alpha-meaning sets of formulas and confidence of formulas, respectively. While using similarity classes is a common method of analyzing incomplete information in the literature, the proposed concept of approximability and the two approaches in conceptual formulation point out new promising directions.

[196] LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis

Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong, Xiaoming Yu, Yang Wang, Huazhe Tan, Weibin Yao, Jianshu Li

Main category: cs.AI

TL;DR: LogicLens is a unified visual-textual co-reasoning framework for text-centric forgery analysis that integrates detection, grounding, and explanation into a joint task using cross-cues-aware reasoning.

DetailsMotivation: Current methods for text-centric forgery analysis are limited to coarse-grained visual analysis, lack sophisticated reasoning capabilities, and treat detection, grounding, and explanation as separate tasks, missing opportunities for holistic performance enhancement.

Method: Introduces LogicLens with Cross-Cues-aware Chain of Thought (CCT) mechanism for iterative visual-textual cross-validation, weighted multi-task reward function for GRPO optimization, PR² pipeline for annotation generation, and RealText dataset with 5,397 images.

Result: Superior performance across benchmarks: 41.4% improvement over specialized framework and 23.4% over GPT-4o on T-IC13 in zero-shot evaluation; significant lead over other MLLM-based methods on T-SROIE dataset in multiple metrics.

Conclusion: LogicLens effectively addresses limitations of existing text-centric forgery analysis methods through unified visual-textual co-reasoning, demonstrating state-of-the-art performance and providing valuable resources (dataset, model, code) for the community.

Abstract: Sophisticated text-centric forgeries, fueled by rapid AIGC advancements, pose a significant threat to societal security and information authenticity. Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis and lack the capacity for sophisticated reasoning. Moreover, they typically treat detection, grounding, and explanation as discrete sub-tasks, overlooking their intrinsic relationships for holistic performance enhancement. To address these challenges, we introduce LogicLens, a unified framework for Visual-Textual Co-reasoning that reformulates these objectives into a joint task. The deep reasoning of LogicLens is powered by our novel Cross-Cues-aware Chain of Thought (CCT) mechanism, which iteratively cross-validates visual cues against textual logic. To ensure robust alignment across all tasks, we further propose a weighted multi-task reward function for GRPO-based optimization. Complementing this framework, we first designed the PR$^2$ (Perceiver, Reasoner, Reviewer) pipeline, a hierarchical and iterative multi-agent system that generates high-quality, cognitively-aligned annotations. Then, we constructed RealText, a diverse dataset comprising 5,397 images with fine-grained annotations, including textual explanations, pixel-level segmentation, and authenticity labels for model training. Extensive experiments demonstrate the superiority of LogicLens across multiple benchmarks. In a zero-shot evaluation on T-IC13, it surpasses the specialized framework by 41.4% and GPT-4o by 23.4% in macro-average F1 score. Moreover, on the challenging dense-text T-SROIE dataset, it establishes a significant lead over other MLLM-based methods in mF1, CSS, and the macro-average F1. Our dataset, model, and code will be made publicly available.

[197] Democratizing Drug Discovery with an Orchestrated, Knowledge-Driven Multi-Agent Team for User-Guided Therapeutic Design

Takahide Suzuki, Kazuki Nakanishi, Takashi Fujiwara, Hideyuki Shimizu

Main category: cs.AI

TL;DR: OrchestRA is an autonomous multi-agent AI platform that integrates biology, chemistry, and pharmacology to transform drug discovery from stochastic search to programmable evidence-based engineering.

DetailsMotivation: Therapeutic discovery faces challenges from domain fragmentation and the gap between computational design and physiological validation. Current generative AI models are passive assistants rather than autonomous executors.

Method: OrchestRA uses a human-in-the-loop multi-agent platform with three specialized agents: Biologist Agent (deep reasoning over >10M association knowledge graph), Chemist Agent (autonomous structural pocket detection for de novo design/drug repositioning), and Pharmacologist Agent (PBPK simulations). Governed by an Orchestrator, agents actively execute simulations and reason results for iterative optimization.

Result: The platform establishes a dynamic feedback loop where pharmacokinetic and toxicity profiles directly trigger structural reoptimization, creating an autonomous discovery engine that unifies biology, chemistry, and pharmacology.

Conclusion: OrchestRA democratizes therapeutic design by seamlessly integrating autonomous execution with human guidance, transforming drug discovery into a programmable evidence-based engineering discipline.

Abstract: Therapeutic discovery remains a formidable challenge, impeded by the fragmentation of specialized domains and the execution gap between computational design and physiological validation. Although generative AI offers promise, current models often function as passive assistants rather than as autonomous executors. Here, we introduce OrchestRA, a human-in-the-loop multi-agent platform that unifies biology, chemistry, and pharmacology into an autonomous discovery engine. Unlike static code generators, our agents actively execute simulations and reason the results to drive iterative optimization. Governed by an Orchestrator, a Biologist Agent leverages deep reasoning over a massive knowledge graph (>10 million associations) to pinpoint high-confidence targets; a Chemist Agent autonomously detects structural pockets for de novo design or drug repositioning; and a Pharmacologist Agent evaluates candidates via rigorous physiologically based pharmacokinetic (PBPK) simulations. This architecture establishes a dynamic feedback loop where pharmacokinetic and toxicity profiles directly trigger structural reoptimization. By seamlessly integrating autonomous execution with human guidance, OrchestRA democratizes therapeutic design, transforming drug discovery from a stochastic search to a programmable evidence-based engineering discipline.

[198] Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model

Yanhao Li, Lu Ma, Jiaran Zhang, Lexiang Tang, Wentao Zhang, Guibo Luo

Main category: cs.AI

TL;DR: Leash is a reinforcement learning framework that uses adaptive length penalties to reduce LLM reasoning length by 60% while maintaining performance across diverse tasks.

DetailsMotivation: Fixed length penalties are hard to tune and fail to adapt to LLMs' evolving reasoning abilities, leading to suboptimal trade-offs between accuracy and conciseness.

Method: Formulates length control as constrained optimization, uses Lagrangian primal-dual method to dynamically adjust penalty coefficient based on whether generations exceed or fall short of target length.

Result: Reduces average reasoning length by 60% across diverse tasks (in-distribution math reasoning and out-of-distribution coding/instruction following) while maintaining competitive performance on Deepseek-R1-Distill-Qwen-1.5B and Qwen3-4B-Thinking-2507.

Conclusion: Presents a practical and effective paradigm for developing controllable and efficient LLMs that balance reasoning capabilities with computational budgets.

Abstract: Existing approaches typically rely on fixed length penalties, but such penalties are hard to tune and fail to adapt to the evolving reasoning abilities of LLMs, leading to suboptimal trade-offs between accuracy and conciseness. To address this challenge, we propose Leash (adaptive LEngth penAlty and reward SHaping), a reinforcement learning framework for efficient reasoning in LLMs. We formulate length control as a constrained optimization problem and employ a Lagrangian primal-dual method to dynamically adjust the penalty coefficient. When generations exceed the target length, the penalty is intensified; when they are shorter, it is relaxed. This adaptive mechanism guides models toward producing concise reasoning without sacrificing task performance. Experiments on Deepseek-R1-Distill-Qwen-1.5B and Qwen3-4B-Thinking-2507 show that Leash reduces the average reasoning length by 60% across diverse tasks - including in-distribution mathematical reasoning and out-of-distribution domains such as coding and instruction following - while maintaining competitive performance. Our work thus presents a practical and effective paradigm for developing controllable and efficient LLMs that balance reasoning capabilities with computational budgets.

[199] NEMO-4-PAYPAL: Leveraging NVIDIA’s Nemo Framework for empowering PayPal’s Commerce Agent

Ali Sahami, Sudhanshu Garg, Andrew Wang, Chaitanya Kulkarni, Farhad Farahani, Sean Yun-Shiuan Chuang, Jian Wan, Srinivasan Manoharan, Uma Kona, Nitin Sharma, Linsey Pang, Prakhar Mehrotra, Jessica Clark, Mark Moyou

Main category: cs.AI

TL;DR: PayPal developed Commerce Agent using NVIDIA’s NeMo Framework to optimize multi-agent commerce systems, showing significant latency and cost improvements while maintaining quality.

DetailsMotivation: To revolutionize agentic commerce on PayPal platform by optimizing multi-agent systems, specifically addressing performance issues in retrieval components that account for over 50% of total agent response time.

Method: Used NVIDIA’s NeMo Framework for LLM fine-tuning, replaced base model with fine-tuned Nemotron SLM, conducted systematic hyperparameter sweeps across learning rates, optimizers (Adam/AdamW), cosine annealing schedules, and LoRA ranks using llama3.1-nemotron-nano-8B-v1 architecture.

Result: Fine-tuned Nemotron SLM effectively resolved key performance issues in retrieval component while maintaining or enhancing overall system performance, achieving significant improvements in latency and cost.

Conclusion: The approach demonstrates successful application of NVIDIA’s NeMo Framework to commerce-specific agent optimization, providing a scalable framework for multi-agent system optimization in production e-commerce environments.

Abstract: We present the development and optimization of PayPal’s Commerce Agent, powered by NEMO-4-PAYPAL, a multi-agent system designed to revolutionize agentic commerce on the PayPal platform. Through our strategic partnership with NVIDIA, we leveraged the NeMo Framework for LLM model fine-tuning to enhance agent performance. Specifically, we optimized the Search and Discovery agent by replacing our base model with a fine-tuned Nemotron small language model (SLM). We conducted comprehensive experiments using the llama3.1-nemotron-nano-8B-v1 architecture, training LoRA-based models through systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks. Our contributions include: (1) the first application of NVIDIA’s NeMo Framework to commerce-specific agent optimization, (2) LLM powered fine-tuning strategy for retrieval-focused commerce tasks, (3) demonstration of significant improvements in latency and cost while maintaining agent quality, and (4) a scalable framework for multi-agent system optimization in production e-commerce environments. Our results demonstrate that the fine-tuned Nemotron SLM effectively resolves the key performance issue in the retrieval component, which represents over 50% of total agent response time, while maintaining or enhancing overall system performance.

[200] A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning

Zelin Zang, Wenyi Gu, Siqi Ma, Dan Yang, Yue Shen, Zhu Zhang, Guohui Fan, Wing-Kuen Ling, Fuji Yang

Main category: cs.AI

TL;DR: A diagnostic framework combining vision-language alignment with logic-regularized reasoning to improve reliability and interpretability of multimodal medical AI systems.

DetailsMotivation: Current multimodal models (LLMs/VLMs) in medicine often produce hallucinations or inconsistent reasoning chains, limiting clinical trust despite integrating clinical text and medical imaging.

Method: Built upon LLaVA, the framework includes: input encoder for text/images, projection module for cross-modal alignment, reasoning controller that decomposes diagnostic tasks into steps, and logic tree generator that assembles stepwise premises into verifiable conclusions.

Result: Evaluations on MedXpertQA and other benchmarks show improved diagnostic accuracy and more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings.

Conclusion: The approach represents a promising step toward trustworthy multimodal medical AI by addressing reliability and interpretability issues in clinical reasoning.

Abstract: With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings. These results suggest a promising step toward trustworthy multimodal medical AI.

[201] AMS-IO-Bench and AMS-IO-Agent: Benchmarking and Structured Reasoning for Analog and Mixed-Signal Integrated Circuit Input/Output Design

Zhishuai Zhang, Xintian Li, Shilong Liu, Aodong Zhang, Lu Jie, Nan Sun

Main category: cs.AI

TL;DR: AMS-IO-Agent: An LLM-based agent that automates structure-aware I/O subsystem generation for analog/mixed-signal ICs, connecting natural language design intent to industrial deliverables with over 70% DRC+LVS pass rate.

DetailsMotivation: To bridge the gap between natural language design intent and industrial-level AMS IC design deliverables, automating the time-consuming I/O subsystem generation process that currently takes hours of manual work.

Method: A framework integrating: (1) structured domain knowledge base with reusable constraints and design conventions, (2) design intent structuring that converts ambiguous user intent into verifiable logic steps using JSON and Python as intermediate formats.

Result: Achieves over 70% DRC+LVS pass rate on AMS-IO-Bench benchmark, reduces design turnaround time from hours to minutes, outperforms baseline LLM. Agent-generated I/O ring was fabricated and validated in 28nm CMOS tape-out.

Conclusion: First reported human-agent collaborative AMS IC design where LLM-based agent completes nontrivial subtask with outputs directly used in silicon, demonstrating practical effectiveness in real AMS IC design flows.

Abstract: In this paper, we propose AMS-IO-Agent, a domain-specialized LLM-based agent for structure-aware input/output (I/O) subsystem generation in analog and mixed-signal (AMS) integrated circuits (ICs). The central contribution of this work is a framework that connects natural language design intent with industrial-level AMS IC design deliverables. AMS-IO-Agent integrates two key capabilities: (1) a structured domain knowledge base that captures reusable constraints and design conventions; (2) design intent structuring, which converts ambiguous user intent into verifiable logic steps using JSON and Python as intermediate formats. We further introduce AMS-IO-Bench, a benchmark for wirebond-packaged AMS I/O ring automation. On this benchmark, AMS-IO-Agent achieves over 70% DRC+LVS pass rate and reduces design turnaround time from hours to minutes, outperforming the baseline LLM. Furthermore, an agent-generated I/O ring was fabricated and validated in a 28 nm CMOS tape-out, demonstrating the practical effectiveness of the approach in real AMS IC design flows. To our knowledge, this is the first reported human-agent collaborative AMS IC design in which an LLM-based agent completes a nontrivial subtask with outputs directly used in silicon.

[202] Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing

Hong Xie, Haoran Gu, Yanying Huang, Tao Tan, Defu Lian

Main category: cs.AI

TL;DR: The paper proposes a multiple-play stochastic bandit variant for resource allocation in LLM/edge intelligence applications, with prioritized capacity sharing and matching regret bounds.

DetailsMotivation: Resource allocation problems in emerging applications like LLM serving and edge intelligence require handling prioritized access to stochastic capacities, which existing bandit models don't address.

Method: Designs MSB-PRS-OffOpt algorithm for optimal policy computation with O(MK³) complexity, then builds approximate UCB algorithm using it as subroutine for online learning.

Result: Proves instance-independent regret lower bound Ω(α₁σ√(KMT)) and instance-dependent Ω(α₁σ²(M/Δ)lnT), with matching upper bounds up to logarithmic/K² factors.

Conclusion: The framework successfully addresses prioritized resource sharing in bandits with theoretical guarantees, enabling efficient resource allocation for modern applications.

Abstract: This paper proposes a variant of multiple-play stochastic bandits tailored to resource allocation problems arising from LLM applications, edge intelligence, etc. The model is composed of $M$ arms and $K$ plays. Each arm has a stochastic number of capacities, and each unit of capacity is associated with a reward function. Each play is associated with a priority weight. When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner. Instance independent and instance dependent regret lower bounds of $Ω( α_1 σ\sqrt{KM T} )$ and $Ω(α_1 σ^2 \frac{M}Δ \ln T)$ are proved, where $α_1$ is the largest priority weight and $σ$ characterizes the reward tail. When model parameters are given, we design an algorithm named \texttt{MSB-PRS-OffOpt} to locate the optimal play allocation policy with a computational complexity of $O(MK^3)$. Utilizing \texttt{MSB-PRS-OffOpt} as a subroutine, an approximate upper confidence bound (UCB) based algorithm is designed, which has instance independent and instance dependent regret upper bounds matching the corresponding lower bound up to factors of $ \sqrt{K \ln KT }$ and $α_1 K^2$ respectively. To this end, we address nontrivial technical challenges arising from optimizing and learning under a special nonlinear combinatorial utility function induced by the prioritized resource sharing mechanism.

[203] Towards Responsible and Explainable AI Agents with Consensus-Driven Reasoning

Eranga Bandara, Tharaka Hewa, Ross Gore, Sachin Shetty, Ravi Mukkamala, Peter Foytik, Abdul Rahman, Safdar H. Bouk, Xueping Liang, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Main category: cs.AI

TL;DR: The paper proposes a Responsible and Explainable AI Agent Architecture for production-grade agentic workflows using multi-model consensus and reasoning-layer governance to address challenges in explainability, accountability, robustness, and governance.

DetailsMotivation: Agentic AI systems enable powerful autonomous capabilities but introduce critical challenges in explainability, accountability, robustness, and governance, especially when agent outputs influence downstream actions. Existing implementations focus on functionality and scalability but lack mechanisms for understanding decision rationale or enforcing responsibility across agent interactions.

Method: A consortium of heterogeneous LLM and VLM agents independently generate candidate outputs from shared input context, exposing uncertainty and alternative interpretations. A dedicated reasoning agent performs structured consolidation across outputs, enforcing safety/policy constraints, mitigating hallucinations/bias, and producing auditable evidence-backed decisions. Explainability is achieved through cross-model comparison and preserved intermediate outputs, while responsibility is enforced via centralized reasoning-layer control and agent-level constraints.

Result: The architecture was evaluated across multiple real-world agentic AI workflows, demonstrating that consensus-driven reasoning improves robustness, transparency, and operational trust across diverse application domains.

Conclusion: The work provides practical guidance for designing agentic AI systems that are both autonomous/scalable and responsible/explainable by construction, addressing critical production deployment challenges through multi-model consensus and reasoning-layer governance.

Abstract: Agentic AI represents a major shift in how autonomous systems reason, plan, and execute multi-step tasks through the coordination of Large Language Models (LLMs), Vision Language Models (VLMs), tools, and external services. While these systems enable powerful new capabilities, increasing autonomy introduces critical challenges related to explainability, accountability, robustness, and governance, especially when agent outputs influence downstream actions or decisions. Existing agentic AI implementations often emphasize functionality and scalability, yet provide limited mechanisms for understanding decision rationale or enforcing responsibility across agent interactions. This paper presents a Responsible(RAI) and Explainable(XAI) AI Agent Architecture for production-grade agentic workflows based on multi-model consensus and reasoning-layer governance. In the proposed design, a consortium of heterogeneous LLM and VLM agents independently generates candidate outputs from a shared input context, explicitly exposing uncertainty, disagreement, and alternative interpretations. A dedicated reasoning agent then performs structured consolidation across these outputs, enforcing safety and policy constraints, mitigating hallucinations and bias, and producing auditable, evidence-backed decisions. Explainability is achieved through explicit cross-model comparison and preserved intermediate outputs, while responsibility is enforced through centralized reasoning-layer control and agent-level constraints. We evaluate the architecture across multiple real-world agentic AI workflows, demonstrating that consensus-driven reasoning improves robustness, transparency, and operational trust across diverse application domains. This work provides practical guidance for designing agentic AI systems that are autonomous and scalable, yet responsible and explainable by construction.

[204] Compliance Rating Scheme: A Data Provenance Framework for Generative AI Datasets

Matyas Bohacek, Ignacio Vilanova Echavarri

Main category: cs.AI

TL;DR: The paper introduces a Compliance Rating Scheme (CRS) framework and open-source Python library to evaluate dataset compliance with transparency, accountability, and security principles, addressing ethical gaps in GAI dataset creation.

DetailsMotivation: GAI datasets are often built using unrestricted, opaque data collection practices, with ethical/legal considerations neglected. Information about dataset origin, legitimacy, and safety gets lost as datasets are shared and reproduced online.

Method: Introduces Compliance Rating Scheme (CRS) framework for evaluating dataset compliance with transparency, accountability, and security principles. Releases open-source Python library built around data provenance technology for seamless integration into existing pipelines.

Result: Developed a reactive and proactive system that both evaluates existing datasets’ CRS and informs responsible scraping/construction of new datasets through an open-source implementation.

Conclusion: The CRS framework and accompanying library address critical gaps in GAI dataset ethics by providing tools for compliance evaluation and promoting responsible dataset creation practices.

Abstract: Generative Artificial Intelligence (GAI) has experienced exponential growth in recent years, partly facilitated by the abundance of large-scale open-source datasets. These datasets are often built using unrestricted and opaque data collection practices. While most literature focuses on the development and applications of GAI models, the ethical and legal considerations surrounding the creation of these datasets are often neglected. In addition, as datasets are shared, edited, and further reproduced online, information about their origin, legitimacy, and safety often gets lost. To address this gap, we introduce the Compliance Rating Scheme (CRS), a framework designed to evaluate dataset compliance with critical transparency, accountability, and security principles. We also release an open-source Python library built around data provenance technology to implement this framework, allowing for seamless integration into existing dataset-processing and AI training pipelines. The library is simultaneously reactive and proactive, as in addition to evaluating the CRS of existing datasets, it equally informs responsible scraping and construction of new datasets.

[205] Accelerating Scientific Discovery with Autonomous Goal-evolving Agents

Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G. Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, Cassandra Masschelein, Yingze Wang, Haorui Wang, Haojun Jia, Chao Zhang, Hongyu Zhao, Martin Ester, Teresa Head-Gordon, Carla P. Gomes, Huan Sun, Chenru Duan, Philippe Schwaller, Wengong Jin

Main category: cs.AI

TL;DR: SAGA introduces an autonomous scientific agent that automatically designs objective functions rather than relying on fixed human-specified objectives, using a bi-level architecture with LLM agents to propose and optimize objectives for scientific discovery.

DetailsMotivation: Current scientific discovery agents rely on imperfect proxy objective functions specified by humans, which limits their effectiveness for grand scientific challenges. There's an unmet need for automating objective function design to improve scientific discovery.

Method: SAGA uses a bi-level architecture: an outer loop with LLM agents analyzes optimization outcomes, proposes new objectives, and converts them into computable scoring functions, while an inner loop performs solution optimization under current objectives.

Result: The framework demonstrates substantial improvements in scientific discovery effectiveness across diverse applications including antibiotic design, inorganic materials design, functional DNA sequence design, and chemical process design.

Conclusion: Automating objective formulation through SAGA’s bi-level architecture enables systematic exploration of objective spaces and trade-offs, significantly enhancing the capabilities of scientific discovery agents beyond fixed objective optimization.

Abstract: There has been unprecedented interest in developing agents that expand the boundary of scientific discovery, primarily by optimizing quantitative objective functions specified by scientists. However, for grand challenges in science , these objectives are only imperfect proxies. We argue that automating objective function design is a central, yet unmet requirement for scientific discovery agents. In this work, we introduce the Scientific Autonomous Goal-evolving Agent (SAGA) to amend this challenge. SAGA employs a bi-level architecture in which an outer loop of LLM agents analyzes optimization outcomes, proposes new objectives, and converts them into computable scoring functions, while an inner loop performs solution optimization under the current objectives. This bi-level design enables systematic exploration of the space of objectives and their trade-offs, rather than treating them as fixed inputs. We demonstrate the framework through a broad spectrum of applications, including antibiotic design, inorganic materials design, functional DNA sequence design, and chemical process design, showing that automating objective formulation can substantially improve the effectiveness of scientific discovery agents.

[206] SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?

Kenny Workman, Zhen Yang, Harihara Muralidharan, Hannah Le

Main category: cs.AI

TL;DR: SpatialBench: A benchmark of 146 verifiable problems from real spatial transcriptomics workflows to evaluate AI agents’ ability to extract biological insights from messy spatial datasets.

DetailsMotivation: Spatial transcriptomics assays are becoming more complex, creating computational bottlenecks. While AI agents have improved at software engineering and general data analysis, it's unclear if they can extract meaningful biological insights from messy, real-world spatial datasets.

Method: Created SpatialBench benchmark with 146 verifiable problems derived from practical spatial analysis workflows across five spatial technologies and seven task categories. Each problem provides experimental data snapshots and deterministic graders to evaluate recovery of key biological results.

Result: Base model accuracy remains low (20-38% across model families) with strong model-task and model-platform interactions. Harness design (tools, prompts, control flow, execution environment) has large empirical effect on performance.

Conclusion: SpatialBench serves as both measurement tool and diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly. Tools, prompts, and execution environment should be evaluated as first-class objects.

Abstract: Spatial transcriptomics assays are rapidly increasing in scale and complexity, making computational analysis a major bottleneck in biological discovery. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world spatial datasets. We introduce SpatialBench, a benchmark of 146 verifiable problems derived from practical spatial analysis workflows spanning five spatial technologies and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on frontier models shows that base model accuracy remains low (20-38% across model families), with strong model-task and model-platform interactions. Harness design has a large empirical effect on performance, indicating that tools, prompts, control flow, and execution environment should be evaluated and improved as first-class objects. SpatialBench serves both as a measurement tool and a diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly.

[207] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

Zubair Shah, Noaman Khan

Main category: cs.AI

TL;DR: Pruning as equilibrium outcome of strategic game among model components, where sparsity emerges naturally when participation becomes dominated strategy.

DetailsMotivation: Most pruning methods treat sparsity as externally imposed constraint using heuristics or regularization, lacking principled foundation. This work proposes viewing pruning as equilibrium outcome of strategic interaction among model components.

Method: Model parameter groups (weights, neurons, filters) as players in continuous non-cooperative game. Each player selects participation level to balance contribution vs redundancy/competition. Sparsity emerges when participation becomes dominated strategy at equilibrium. Derive equilibrium-driven pruning algorithm that jointly updates parameters and participation variables without explicit importance scores.

Result: Shows dominated players collapse to zero participation under mild conditions, providing principled explanation for pruning behavior. Experiments on standard benchmarks demonstrate competitive sparsity-accuracy trade-offs with interpretable, theory-grounded approach.

Conclusion: Proposes novel game-theoretic perspective on neural network pruning where sparsity emerges naturally from equilibrium analysis, offering principled alternative to heuristic pruning methods with competitive performance.

Abstract: Neural network pruning is widely used to reduce model size and computational cost. Yet, most existing methods treat sparsity as an externally imposed constraint, enforced through heuristic importance scores or training-time regularization. In this work, we propose a fundamentally different perspective: pruning as an equilibrium outcome of strategic interaction among model components. We model parameter groups such as weights, neurons, or filters as players in a continuous non-cooperative game, where each player selects its level of participation in the network to balance contribution against redundancy and competition. Within this formulation, sparsity emerges naturally when continued participation becomes a dominated strategy at equilibrium. We analyze the resulting game and show that dominated players collapse to zero participation under mild conditions, providing a principled explanation for pruning behavior. Building on this insight, we derive a simple equilibrium-driven pruning algorithm that jointly updates network parameters and participation variables without relying on explicit importance scores. This work focuses on establishing a principled formulation and empirical validation of pruning as an equilibrium phenomenon, rather than exhaustive architectural or large-scale benchmarking. Experiments on standard benchmarks demonstrate that the proposed approach achieves competitive sparsity-accuracy trade-offs while offering an interpretable, theory-grounded alternative to existing pruning methods.

[208] Creative Agents: Empowering Agents with Imagination for Creative Tasks

Penglin Cai, Chi Zhang, Yuhui Fu, Haoqi Yuan, Zongqing Lu

Main category: cs.AI

TL;DR: Creative agents for open-ended tasks in Minecraft using imagination-enhanced controllers to convert abstract instructions into concrete goals and perform long-horizon planning.

DetailsMotivation: Existing instruction-following agents lack creativity - the ability to produce novel and diverse solutions implicit in language instructions. They fail to convert abstract instructions into concrete goals and perform long-horizon planning for complex creative tasks.

Method: Propose creative agents with imagination-enhanced controllers: imaginator generates detailed task outcome imaginations (textual via LLM or visual via diffusion model), controller executes tasks (behavior-cloning policy or foundation model generating executable code). Tested in Minecraft building creation tasks.

Result: Creative agents are the first AI agents accomplishing diverse building creation in Minecraft survival mode. Proposed novel GPT-4V-based evaluation metrics for open-ended creative tasks, overcoming limitations of existing metrics.

Conclusion: Imagination-enhanced agents enable creativity in open-ended tasks by bridging abstract instructions to concrete execution. The approach shows promise for creative embodied AI, with open-source benchmark and models provided for future research.

Abstract: We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity – the ability to give novel and diverse solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete goals and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with imagination, we propose a class of solutions, where the controller is enhanced with an imaginator generating detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents create diverse buildings given free-form language instructions. We propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).

[209] ForestProtector: An IoT Architecture Integrating Machine Vision and Deep Reinforcement Learning for Efficient Wildfire Monitoring

Kenneth Bonilla-Ormachea, Horacio Cuizaga, Edwin Salcedo, Sebastian Castro, Sergio Fernandez-Testa, Misael Mamani

Main category: cs.AI

TL;DR: Low-cost forest fire detection system using central gateway with computer vision and deep reinforcement learning for automated 360° monitoring with reduced false positives.

DetailsMotivation: Early fire detection is critical as fire duration exponentially increases extinguishing difficulty and cost. Existing systems are expensive, require human intervention, and can't continuously monitor large areas effectively.

Method: Central gateway device with computer vision monitors 360° field of view for smoke at long distances. Deep reinforcement learning agent dynamically controls camera orientation using real-time sensor data (smoke, temperature, humidity) from distributed IoT devices.

Result: Proposed system enables automated wildfire monitoring across expansive areas while reducing false positives through intelligent sensor fusion and adaptive camera control.

Conclusion: The low-cost system addresses limitations of existing fire detection technologies by combining computer vision, IoT sensors, and reinforcement learning for practical, scalable forest fire monitoring.

Abstract: Early detection of forest fires is crucial to minimizing the environmental and socioeconomic damage they cause. Indeed, a fire’s duration directly correlates with the difficulty and cost of extinguishing it. For instance, a fire burning for 1 minute might require 1 liter of water to extinguish, while a 2-minute fire could demand 100 liters, and a 10-minute fire might necessitate 1,000 liters. On the other hand, existing fire detection systems based on novel technologies (e.g., remote sensing, PTZ cameras, UAVs) are often expensive and require human intervention, making continuous monitoring of large areas impractical. To address this challenge, this work proposes a low-cost forest fire detection system that utilizes a central gateway device with computer vision capabilities to monitor a 360° field of view for smoke at long distances. A deep reinforcement learning agent enhances surveillance by dynamically controlling the camera’s orientation, leveraging real-time sensor data (smoke levels, ambient temperature, and humidity) from distributed IoT devices. This approach enables automated wildfire monitoring across expansive areas while reducing false positives.

[210] CP-Agent: Agentic Constraint Programming

Stefan Szeider

Main category: cs.AI

TL;DR: CP-Agent is a Python coding agent using ReAct framework to translate natural language into constraint programming models, solving all 101 problems in CP-Bench after benchmark clarification.

DetailsMotivation: Translating natural language into formal constraint models requires domain expertise and modeling framework knowledge. The paper investigates whether agentic workflows can benefit constraint modeling by automating this translation process.

Method: Introduces CP-Agent, a Python coding agent using the ReAct framework with persistent IPython kernel. It uses domain knowledge via project prompts (<50 lines), iteratively executes code, observes solver feedback, and refines models based on execution results.

Result: Evaluated on CP-Bench’s 101 constraint programming problems. After clarifying benchmark ambiguities and fixing ground-truth errors, CP-Agent solves all 101 problems. Ablation studies show minimal guidance outperforms detailed procedural scaffolding, and explicit task management tools have mixed effects.

Conclusion: Agentic workflows are effective for constraint modeling, with CP-Agent successfully automating natural language to constraint model translation. The approach demonstrates that minimal guidance works better than detailed scaffolding for focused modeling tasks.

Abstract: Translating natural language into formal constraint models requires expertise in the problem domain and modeling frameworks. To investigate whether constraint modeling benefits from agentic workflows, we introduce CP-Agent, a Python coding agent using the ReAct framework with a persistent IPython kernel. Domain knowledge is provided through a project prompt of under 50 lines. The agent iteratively executes code, observes the solver’s feedback, and refines models based on the execution results. We evaluate CP-Agent on CP-Bench’s 101 constraint programming problems. We clarified the benchmark to address systematic ambiguities in problem specifications and errors in ground-truth models. On the clarified benchmark, CP-Agent solves all 101 problems. Ablation studies indicate that minimal guidance outperforms detailed procedural scaffolding, and that explicit task management tools have mixed effects on focused modeling tasks.

[211] Lightweight Diffusion-based Framework for Online Imagined Speech Decoding in Aphasia

Eunyeong Ko, Soowon Kim, Ha-Na Jo

Main category: cs.AI

TL;DR: Real-time imagined speech decoding for aphasia using lightweight diffusion model achieves 65% top-1 accuracy in online feedback phase.

DetailsMotivation: Individuals with aphasia struggle with real-time verbal communication, but existing imagined speech decoding approaches are limited to offline analysis or computationally demanding models, creating a need for real-time solutions.

Method: Two-session framework: offline data acquisition followed by online feedback phase. Uses four-class Korean-language task with three imagined speech targets based on participant’s daily needs plus resting state. Introduces lightweight diffusion-based neural decoding model optimized for real-time inference through dimensionality reduction, temporal kernel optimization, group normalization with regularization, and dual early-stopping criteria.

Result: Real-time evaluation achieved 65% top-1 and 70% top-2 accuracy overall, with the Water class reaching 80% top-1 and 100% top-2 accuracy in a single individual with chronic anomic aphasia.

Conclusion: Real-time-optimized diffusion-based architectures combined with clinically grounded task design can support feasible online imagined speech decoding for communication-oriented BCI applications in aphasia.

Abstract: Individuals with aphasia experience severe difficulty in real-time verbal communication, while most imagined speech decoding approaches remain limited to offline analysis or computationally demanding models. To address this limitation, we propose a two-session experimental framework consisting of an offline data acquisition phase and a subsequent online feedback phase for real-time imagined speech decoding. The paradigm employed a four-class Korean-language task, including three imagined speech targets selected according to the participant’s daily communicative needs and a resting-state condition, and was evaluated in a single individual with chronic anomic aphasia. Within this framework, we introduce a lightweight diffusion-based neural decoding model explicitly optimized for real-time inference, achieved through architectural simplifications such as dimensionality reduction, temporal kernel optimization, group normalization with regularization, and dual early-stopping criteria. In real-time evaluation, the proposed system achieved 65 percent top-1 and 70 percent top-2 accuracy, with the Water class reaching 80 percent top-1 and 100 percent top-2 accuracy. These results demonstrate that real-time-optimized diffusion-based architectures, combined with clinically grounded task design, can support feasible online imagined speech decoding for communication-oriented BCI applications in aphasia.

[212] Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Pei Yang, Ke Zhang, Ji Wang, Xiao Chen, Yuxin Tang, Eric Yang, Lynn Ai, Bill Shi

Main category: cs.AI

TL;DR: CRM replaces single reward models with a team of specialist evaluators for better robustness and interpretability in RLHF, using rewardBench for training and assessment.

DetailsMotivation: Conventional reward models struggle with optimizing multiple conflicting preference dimensions (factuality, helpfulness, safety) and lack transparency in scoring decisions.

Method: Decomposes preference evaluation into domain-specific agents producing partial signals, plus global evaluators (ranker-based, embedding-similarity). A centralized aggregator fuses signals at each timestep, balancing factors like step-wise correctness and multi-agent agreement.

Result: Provides a practical, modular framework for transparent reward modeling and stable optimization, compatible with standard RL pipelines using advantage-based updates (GAE) and value model regression.

Conclusion: CRM and rewardBench together offer a path to more interpretable and robust reward modeling without requiring additional human annotations beyond those used to train the evaluators.

Abstract: We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.

[213] Socratic Students: Teaching Language Models to Learn by Asking Questions

Rajeev Bhatt Ambati, Tianyi Niu, Aashu Singh, Shlok Mishra, Snigdha Chaturvedi, Shashank Srivastava

Main category: cs.AI

TL;DR: Student-led active learning approach where LLMs learn to ask targeted questions to teachers, improving performance on math and coding tasks through guided DPO training.

DetailsMotivation: LLMs are good at static knowledge retrieval but struggle in dynamic real-world scenarios (education, healthcare) where information must be actively acquired through interaction. Current research focuses on teacher-led instruction, but this paper shifts focus to student-led active questioning.

Method: Students actively query teachers to acquire missing information. Use Direct Preference Optimization (DPO) with guidance from either self or stronger students to improve question quality. Tested on math and coding benchmarks.

Result: Student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. Guided DPO training enables smaller models to learn better questioning strategies, enhancing learning efficiency.

Conclusion: Active student-led questioning strategies significantly improve LLM performance in interactive settings, and guided training helps smaller models learn effective questioning techniques for better knowledge acquisition.

Abstract: Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.

[214] Universal Reasoning Model

Zitian Gao, Lynx Chen, Yihao Xiao, He Xing, Ran Tao, Haoming Luo, Joey Zhou, Bryan Dai

Main category: cs.AI

TL;DR: Universal Transformers’ performance gains on reasoning tasks come from recurrent inductive bias and Transformer’s nonlinear components, not complex architecture. The proposed Universal Reasoning Model (URM) enhances UT with short convolution and truncated backpropagation, achieving SOTA results on ARC-AGI benchmarks.

DetailsMotivation: Universal Transformers have shown strong performance on complex reasoning tasks like ARC-AGI and Sudoku, but the specific sources of their performance gains remain unclear. The authors aim to systematically analyze UT variants to understand what truly drives their reasoning capabilities.

Method: The paper first systematically analyzes UT variants to identify key performance drivers. Based on findings that recurrent inductive bias and Transformer’s nonlinear components are crucial, they propose the Universal Reasoning Model (URM) which enhances UT with two key components: short convolution and truncated backpropagation.

Result: URM achieves state-of-the-art performance on ARC-AGI benchmarks: 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. The analysis shows that improvements primarily come from recurrent inductive bias and strong nonlinear components rather than elaborate architectural designs.

Conclusion: The performance gains of Universal Transformers on reasoning tasks stem from fundamental properties (recurrent inductive bias and nonlinear components) rather than complex architectural designs. The proposed URM effectively leverages these insights with simple enhancements, achieving superior reasoning performance on challenging benchmarks.

Abstract: Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at https://github.com/UbiquantAI/URM.

[215] Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

Main category: cs.AI

TL;DR: Generative Adversarial Reasoner (GAR) uses adversarial RL with a discriminator to provide dense step-level rewards for improving LLM reasoning, achieving significant gains on math benchmarks.

DetailsMotivation: LLMs with reasoning capabilities still make process errors like incorrect calculations, brittle logic, and superficially plausible but invalid steps. Current methods lack effective step-level feedback for improving reasoning quality.

Method: Introduces Generative Adversarial Reasoner (GAR), an on-policy joint training framework that co-evolves an LLM reasoner and LLM-based discriminator through adversarial RL. Uses compute-efficient review schedule to partition reasoning chains into logically complete slices, with discriminator evaluating each slice’s soundness with structured justifications.

Result: Achieves consistent gains across mathematical benchmarks. On AIME24: improves DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). Produces dense, well-calibrated step-level rewards that improve credit assignment and sample efficiency.

Conclusion: GAR framework effectively enhances LLM reasoning by providing dense step-level rewards through adversarial training. The modular discriminator enables flexible reward shaping for various objectives including teacher distillation, preference alignment, and mathematical proof-based reasoning.

Abstract: Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice’s soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.

[216] External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning

Jian Yan

Main category: cs.AI

TL;DR: The External Hippocampus framework models LLM reasoning as energy flow in semantic space using cognitive maps, enabling efficient navigation and intervention without training, solving cognitive deadlock in small models.

DetailsMotivation: To address cognitive deadlock in multi-step reasoning for small language models (≤7B parameters) without the computational costs of traditional weight-space optimization methods.

Method: Constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of information energy flow at test time. Uses temperature perturbations to restart energy flow when stagnation occurs.

Result: Achieves 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduces reasoning time by ≥15x, identifies reasoning stagnation as “Cognitive Vortex” and low-entropy potential wells, and shows temperature perturbations effectively restart energy flow.

Conclusion: Provides an efficient, controllable topological-aware solution for small model reasoning that requires no additional training, has autonomous growth capability, and offers predictable intervention patterns.

Abstract: This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight-space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi-step reasoning for small models. Experiments on models <=7B parameters show: map-guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by >= 15x, with key findings revealing that reasoning stagnation manifests as “Cognitive Vortex” and low-entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological-aware solution for small model reasoning.

[217] Vox Deorum: A Hybrid LLM Architecture for 4X / Grand Strategy Game AI – Lessons from Civilization V

John Chen, Sihan Cheng, Can Gurkan, Ryan Lay, Moez Salahuddin

Main category: cs.AI

TL;DR: Vox Deorum: A hybrid LLM+X architecture for Civilization V that uses LLMs for macro-strategic reasoning while delegating tactical execution to subsystems, achieving competitive gameplay with distinct play styles.

DetailsMotivation: LLMs' natural language reasoning capabilities make them promising for 4X/grand strategy games to enable more natural human-AI interactions, but game complexity, latency, and cost hinder real-world deployment.

Method: Hybrid LLM+X architecture with layered design: LLMs handle macro-strategic reasoning while delegating tactical execution to subsystems (algorithmic AI or future RL AI). Tested on Civilization V with Vox Populi mod.

Result: Tested with 2,327 complete games comparing two open-source LLMs against Vox Populi’s enhanced AI. LLMs achieved competitive end-to-end gameplay with play styles that substantially diverged from algorithmic AI and from each other.

Conclusion: Establishes a viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research.

Abstract: Large Language Models’ capacity to reason in natural language makes them uniquely promising for 4X and grand strategy games, enabling more natural human-AI gameplay interactions such as collaboration and negotiation. However, these games present unique challenges due to their complexity and long-horizon nature, while latency and cost factors may hinder LLMs’ real-world deployment. Working on a classic 4X strategy game, Sid Meier’s Civilization V with the Vox Populi mod, we introduce Vox Deorum, a hybrid LLM+X architecture. Our layered technical design empowers LLMs to handle macro-strategic reasoning, delegating tactical execution to subsystems (e.g., algorithmic AI or reinforcement learning AI in the future). We validate our approach through 2,327 complete games, comparing two open-source LLMs with a simple prompt against Vox Populi’s enhanced AI. Results show that LLMs achieve competitive end-to-end gameplay while exhibiting play styles that diverge substantially from algorithmic AI and from each other. Our work establishes a viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research.

[218] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang

Main category: cs.AI

TL;DR: VLSM unifies visual and textual understanding to generate executable simulation code from layout sketches and natural language prompts, enabling cross-modal reasoning for industrial simulation systems.

DetailsMotivation: To enable generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems, addressing the need for cross-modal reasoning in industrial applications.

Method: Proposes Vision-Language Simulation Model (VLSM) that synthesizes executable FlexScript from layout sketches and natural-language prompts. Creates first large-scale dataset with 120,000+ prompt-sketch-code triplets for multimodal learning. Introduces three novel evaluation metrics: Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR).

Result: Models achieve near-perfect structural accuracy and high execution robustness through systematic ablation studies across vision encoders, connectors, and code-pretrained language backbones.

Conclusion: This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.

Abstract: We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.

[219] Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model

Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing

Main category: cs.AI

TL;DR: This paper introduces T-MED, the first large-scale multimodal teacher sentiment analysis dataset, and AAM-TSA, an asymmetric attention-based model that outperforms existing methods for analyzing teacher emotions in educational settings.

DetailsMotivation: Teachers' emotional states significantly impact teaching effectiveness and student outcomes, but existing studies fail to accurately capture teacher emotions due to their performative nature and neglect of instructional context.

Method: 1) Created T-MED dataset with 14,938 instances from 250 real classrooms across 11 subjects using human-machine collaborative labeling; 2) Proposed AAM-TSA model with asymmetric attention mechanism and hierarchical gating unit for differentiated cross-modal feature fusion.

Result: AAM-TSA significantly outperforms existing state-of-the-art methods in accuracy and interpretability on the T-MED dataset, demonstrating superior performance in teacher sentiment analysis.

Conclusion: The paper successfully addresses limitations in teacher emotion analysis by creating a comprehensive multimodal dataset and developing an effective model that considers both performative aspects and instructional context, advancing educational sentiment analysis research.

Abstract: Teachers’ emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers’ emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression. In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED. To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process. The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information. Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA. AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.

[220] RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

Main category: cs.AI

TL;DR: RoboSafe: A hybrid reasoning runtime safeguard for embodied agents that uses executable predicate-based safety logic to reduce hazardous actions by 36.8% while maintaining task performance.

DetailsMotivation: Embodied agents powered by vision-language models are increasingly capable but vulnerable to hazardous instructions. Existing defenses (static rule filters, prompt-level control) struggle with implicit risks in dynamic, temporally dependent, and context-rich environments.

Method: Proposes RoboSafe with two complementary reasoning processes on a Hybrid Long-Short Safety Memory: 1) Backward Reflective Reasoning that revisits recent trajectories to infer temporal safety predicates and trigger replanning, and 2) Forward Predictive Reasoning that anticipates upcoming risks by generating context-aware safety predicates from long-term memory and multimodal observations.

Result: Extensive experiments show RoboSafe reduces hazardous actions by 36.8% compared to leading baselines while maintaining near-original task performance. Real-world evaluations on physical robotic arms confirm practicality.

Conclusion: RoboSafe provides an adaptive, verifiable safety logic that is both interpretable and executable as code, offering effective runtime safety guardrails for embodied agents in dynamic environments.

Abstract: Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent’s multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

cs.SD

[221] Semantic Codebooks as Effective Priors for Neural Speech Compression

Liuyang Bai, Weiyi Lu, Li Guo

Main category: cs.SD

TL;DR: SemDAC: A semantic-aware neural audio codec that uses semantic codebooks from HuBERT features for efficient speech compression, achieving better quality and lower bitrates than DAC.

DetailsMotivation: Traditional speech codecs optimize for waveform fidelity, wasting bits on acoustic details that can be inferred from linguistic structure, leading to inefficient compression and poor performance on downstream recognition tasks.

Method: SemDAC uses residual vector quantization (RVQ) where the first quantizer is distilled from HuBERT features to produce semantic tokens capturing phonetic content, while subsequent quantizers model residual acoustics. A FiLM-conditioned decoder reconstructs audio conditioned on semantic tokens.

Result: SemDAC outperforms DAC across perceptual metrics and achieves lower WER when running Whisper on reconstructed speech, while operating at substantially lower bitrates (0.95 kbps vs. 2.5 kbps for DAC).

Conclusion: Semantic codebooks provide an effective inductive bias for neural speech compression, producing compact yet recognition-friendly representations that bridge the gap between compression efficiency and downstream task performance.

Abstract: Speech codecs are traditionally optimized for waveform fidelity, allocating bits to preserve acoustic detail even when much of it can be inferred from linguistic structure. This leads to inefficient compression and suboptimal performance on downstream recognition tasks. We propose SemDAC, a semantic-aware neural audio codec that leverages semantic codebooks as effective priors for speech compression. In SemDAC, the first quantizer in a residual vector quantization (RVQ) stack is distilled from HuBERT features to produce semantic tokens that capture phonetic content, while subsequent quantizers model residual acoustics. A FiLM-conditioned decoder reconstructs audio conditioned on the semantic tokens, improving efficiency in the use of acoustic codebooks. Despite its simplicity, this design proves highly effective: SemDAC outperforms DAC across perceptual metrics and achieves lower WER when running Whisper on reconstructed speech, all while operating at substantially lower bitrates (e.g., 0.95 kbps vs. 2.5 kbps for DAC). These results demonstrate that semantic codebooks provide an effective inductive bias for neural speech compression, producing compact yet recognition-friendly representations.

[222] Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning

Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Zahid Hossain, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman

Main category: cs.SD

TL;DR: First systematic benchmark for Bengali deepfake audio detection using BanglaFake dataset, showing fine-tuned models significantly outperform zero-shot approaches.

DetailsMotivation: Bengali deepfake detection is largely unexplored despite the security concerns posed by rapid growth of speech synthesis and voice conversion systems, creating a need for detection methods in this low-resource language.

Method: Evaluated zero-shot inference with pretrained models (Wav2Vec2-XLSR-53, Whisper, PANNsCNN14, WavLM, Audio Spectrogram Transformer), then fine-tuned multiple architectures (Wav2Vec2-Base, LCNN, LCNN-Attention, ResNet18, ViT-B16, CNN-BiLSTM) for Bengali deepfake detection.

Result: Zero-shot results showed limited detection ability (best: Wav2Vec2-XLSR-53 with 53.80% accuracy). Fine-tuned models achieved strong performance gains, with ResNet18 achieving highest accuracy of 79.17%, F1 score of 79.12%, AUC of 84.37%, and EER of 24.35%.

Conclusion: Fine-tuning significantly improves performance over zero-shot inference for Bengali deepfake detection, providing the first systematic benchmark and highlighting the effectiveness of fine-tuned deep learning models for this low-resource language.

Abstract: The rapid growth of speech synthesis and voice conversion systems has made deepfake audio a major security concern. Bengali deepfake detection remains largely unexplored. In this work, we study automatic detection of Bengali audio deepfakes using the BanglaFake dataset. We evaluate zeroshot inference with several pretrained models. These include Wav2Vec2-XLSR-53, Whisper, PANNsCNN14, WavLM and Audio Spectrogram Transformer. Zero-shot results show limited detection ability. The best model, Wav2Vec2-XLSR-53, achieves 53.80% accuracy, 56.60% AUC and 46.20% EER. We then f ine-tune multiple architectures for Bengali deepfake detection. These include Wav2Vec2-Base, LCNN, LCNN-Attention, ResNet18, ViT-B16 and CNN-BiLSTM. Fine-tuned models show strong performance gains. ResNet18 achieves the highest accuracy of 79.17%, F1 score of 79.12%, AUC of 84.37% and EER of 24.35%. Experimental results confirm that fine-tuning significantly improves performance over zero-shot inference. This study provides the first systematic benchmark of Bengali deepfake audio detection. It highlights the effectiveness of f ine-tuned deep learning models for this low-resource language.

[223] Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone

Jiabin Guo

Main category: cs.SD

TL;DR: VSRSPA algorithm improves DOA estimation for single vector hydrophones using signal reconstruction and SPA optimization, achieving better accuracy in multi-source and low-SNR environments.

DetailsMotivation: Traditional DOA estimation methods have limitations in multi-source environments and under noise interference, especially for single vector hydrophones which need improved accuracy and resolution.

Method: Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA) reconstructs the signal model, converts covariance matrix to Toeplitz structure suitable for SPA algorithm, then optimizes using SPA for DOA estimation.

Result: Simulation analysis confirms superior performance in single and dual-target scenarios, especially under various SNR conditions, with significant advantages in estimation accuracy and resolution compared to traditional methods.

Conclusion: Provides effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions in vector hydrophone signal processing.

Abstract: This article discusses the application of single vector hydrophones in the field of underwater acoustic signal processing for Direction Of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this study introduces a Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This method involves reconstructing the signal model of a single vector hydrophone, converting its covariance matrix into a Toeplitz structure suitable for the Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it using the SPA algorithm to achieve more accurate DOA estimation. Through detailed simulation analysis, this research has confirmed the performance of the proposed algorithm in single and dual-target DOA estimation scenarios, especially under various signal-to-noise ratio(SNR) conditions. The simulation results show that, compared to traditional DOA estimation methods, this algorithm has significant advantages in estimation accuracy and resolution, particularly in multi-source signals and low SNR environments. The contribution of this study lies in providing an effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions and solutions in the field of vector hydrophone signal processing.

[224] Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

Main category: cs.SD

TL;DR: Video-to-Audio models often generate sounds (speech/music) without visual sources - called Insertion Hallucination. Current metrics miss this. Paper introduces HALCON to detect and fix this issue, reducing hallucinations by 50%+.

DetailsMotivation: Existing Video-to-Audio evaluation metrics focus only on semantic/temporal alignment, completely missing a critical failure mode: models generate acoustic events (speech, music) with no corresponding visual source. This Insertion Hallucination is driven by dataset biases (off-screen sounds) and remains undetected by current metrics.

Method: 1) Develop systematic evaluation framework using majority-voting ensemble of multiple audio event detectors. 2) Introduce two novel metrics: IH@vid (fraction of videos with hallucinations) and IH@dur (fraction of hallucinated duration). 3) Propose HALCON: three-stage method that generates initial audio to expose hallucinations, identifies/masks unreliable video features, then regenerates audio with corrected conditioning.

Result: State-of-the-art V2A models suffer from severe Insertion Hallucination. HALCON reduces both prevalence and duration of hallucinations by over 50% on average across mainstream benchmarks, without degrading (and sometimes improving) conventional audio quality and synchronization metrics.

Conclusion: First work to formally define, systematically measure, and effectively mitigate Insertion Hallucination in Video-to-Audio generation. Paves way for more reliable and faithful V2A models by addressing a previously overlooked but critical failure mode.

Abstract: Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we introduce HALCON to mitigate IH. HALCON follows a three-stage procedure: it first generates initial audio to expose hallucinated segments, then identifies and masks the corresponding unreliable video features, and finally regenerates the audio using the corrected conditioning. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our HALCON method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

[225] ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu

Main category: cs.SD

TL;DR: ControlAudio is a progressive diffusion approach for text-to-audio generation with fine-grained control over timing and speech content, using multi-task learning and step-by-step conditioning.

DetailsMotivation: Existing controllable text-to-audio generation methods suffer from data scarcity issues that compromise performance at scale, limiting their ability to handle fine-grained control signals like precise timing and intelligible speech content.

Method: 1) Data construction spanning annotation and simulation to augment condition information (text, timing, phoneme). 2) Pretrain diffusion transformer on large-scale text-audio pairs, then incrementally integrate timing and phoneme features. 3) Progressive guided generation at inference that sequentially emphasizes fine-grained information.

Result: Achieves state-of-the-art performance in temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.

Conclusion: ControlAudio successfully addresses data scarcity in controllable TTA generation through multi-task learning and progressive diffusion modeling, enabling scalable generation with fine-grained control over timing and speech content.

Abstract: Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

cs.LG

[226] Physics-Informed Neural Solvers for Periodic Quantum Eigenproblems

Haaris Mian

Main category: cs.LG

TL;DR: Physics-informed neural network framework solves Floquet-Bloch eigenvalue problems for 2D periodic potentials, focusing on honeycomb lattices with Dirac points, using mesh-free approach with composite loss functions.

DetailsMotivation: The motivation is to develop a physics-informed machine learning approach for solving quantum eigenproblems in periodic potentials, particularly honeycomb lattices (like graphene), which have distinctive band topology with Dirac points and are important for materials science.

Method: Uses neural networks to simultaneously learn complex Bloch functions and eigenvalues through a composite loss function that enforces the Schrödinger equation, Bloch periodicity, and normalization constraints without supervision. The model is trained over the Brillouin zone, and transfer learning is employed to adapt from nearly-free to strongly varying potentials.

Result: The framework successfully recovers band structures and Bloch modes, with numerical validation against traditional plane-wave expansion methods. It demonstrates ability to capture changes in band structure topology through transfer learning from simple to complex potentials.

Conclusion: This work contributes to physics-informed machine learning for quantum eigenproblems, providing insights into the relationship between symmetry, band structure, and neural architectures, offering a mesh-free alternative to traditional computational methods.

Abstract: This thesis presents a physics-informed machine learning framework for solving the Floquet-Bloch eigenvalue problem associated with particles in two-dimensional periodic potentials, with a focus on honeycomb lattice geometry, due to its distinctive band topology featuring Dirac points and its relevance to materials such as graphene. By leveraging neural networks to learn complex Bloch functions and their associated eigenvalues (energies) simultaneously, we develop a mesh-free solver enforcing the governing Schrödinger equation, Bloch periodicity, and normalization constraints through a composite loss function without supervision. The model is trained over the Brillouin zone to recover band structures and Bloch modes, with numerical validation against traditional plane-wave expansion methods. We further explore transfer learning techniques to adapt the solver from nearly-free electron potentials to strongly varying potentials, demonstrating its ability to capture changes in band structure topology. This work contributes to the growing field of physics-informed machine learning for quantum eigenproblems, providing insights into the interplay between symmetry, band structure, and neural architectures.

[227] A Reinforcement Learning Approach to Synthetic Data Generation

Natalia Espinosa-Dice, Nicholas J. Jackson, Chao Yan, Aaron Lee, Bradley A. Malin

Main category: cs.LG

TL;DR: RLSyn reframes synthetic data generation as a reinforcement learning problem, using Proximal Policy Optimization with discriminator rewards for more stable and data-efficient training, outperforming GANs and diffusion models especially in small-sample settings.

DetailsMotivation: Synthetic data generation enables biomedical data sharing while preserving privacy, but current generative models require large datasets and complex training, limiting their use in small-sample settings common in biomedical studies.

Method: RLSyn models data generation as a reinforcement learning problem with a stochastic policy over patient records, optimized using Proximal Policy Optimization with discriminator-derived rewards for stable and data-efficient training.

Result: RLSyn performs comparably to diffusion models and outperforms GANs on MIMIC-IV, while outperforming both diffusion models and GANs on the smaller AI-READI dataset across privacy, utility, and fidelity evaluations.

Conclusion: Reinforcement learning provides a principled and effective alternative for synthetic biomedical data generation, particularly valuable in data-scarce regimes where traditional generative models struggle.

Abstract: Synthetic data generation (SDG) is a promising approach for enabling data sharing in biomedical studies while preserving patient privacy. Yet, state-of-the-art generative models often require large datasets and complex training procedures, limiting their applicability in small-sample settings. In this work, we reframe SDG as a reinforcement learning (RL) problem and introduce RLSyn, a novel framework that models the data generator as a stochastic policy over patient records and optimizes it using Proximal Policy Optimization with discriminator-derived rewards, yielding more stable and data-efficient training. We evaluate RLSyn on two biomedical datasets - AI-READI and MIMIC-IV- and benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across extensive privacy, utility, and fidelity evaluations. RL-Syn performs comparably to diffusion models and outperforms GANs on MIMIC-IV, while outperforming both diffusion models and GANs on the smaller AI-READI dataset. These results demonstrate that reinforcement learning provides a principled and effective alternative for synthetic biomedical data generation, particularly in data-scarce regimes.

[228] kooplearn: A Scikit-Learn Compatible Library of Algorithms for Evolution Operator Learning

Giacomo Turri, Grégoire Pacreau, Giacomo Meanti, Timothée Devergne, Daniel Ordonez, Erfan Mirzaei, Bruno Belucci, Karim Lounici, Vladimir Kostic, Massimiliano Pontil, Pietro Novelli

Main category: cs.LG

TL;DR: kooplearn is a Python library for learning dynamical operators (Koopman/Transfer operators and infinitesimal generators) using linear, kernel, and deep learning methods, with scikit-learn compatible API.

DetailsMotivation: To provide a unified, easy-to-use library for learning dynamical operators from data, enabling spectral analysis, reduced-order modeling, and forecasting of dynamical systems.

Method: Implements linear, kernel, and deep learning estimators for discrete-time evolution operators (Koopman/Transfer) and continuous-time infinitesimal generators, with scikit-learn compliant API.

Result: A comprehensive machine learning library that facilitates analysis of dynamical systems via spectral methods, data-driven reduced-order modeling, and forecasting, with curated benchmark datasets.

Conclusion: kooplearn provides an accessible, standardized tool for learning dynamical operators, supporting research reproducibility and integration into existing ML workflows through its scikit-learn compatible interface.

Abstract: kooplearn is a machine-learning library that implements linear, kernel, and deep-learning estimators of dynamical operators and their spectral decompositions. kooplearn can model both discrete-time evolution operators (Koopman/Transfer) and continuous-time infinitesimal generators. By learning these operators, users can analyze dynamical systems via spectral methods, derive data-driven reduced-order models, and forecast future states and observables. kooplearn’s interface is compliant with the scikit-learn API, facilitating its integration into existing machine learning and data science workflows. Additionally, kooplearn includes curated benchmark datasets to support experimentation, reproducibility, and the fair comparison of learning algorithms. The software is available at https://github.com/Machine-Learning-Dynamical-Systems/kooplearn.

[229] A Survey of Freshness-Aware Wireless Networking with Reinforcement Learning

Alimu Alibotaiken, Suyang Wang, Oluwaseun T. Ajayi, Yu Cheng

Main category: cs.LG

TL;DR: Survey paper on using reinforcement learning for Age of Information optimization in B5G/6G wireless networks, providing a unified framework and taxonomy for freshness-aware learning approaches.

DetailsMotivation: Existing surveys either focus on classical AoI formulations or discuss RL in wireless networks broadly without addressing freshness as a unified learning problem. There's a gap in understanding how RL can specifically optimize data freshness in next-generation wireless systems.

Method: Organizes AoI variants into three families (native, function-based, application-oriented) and introduces a policy-centric taxonomy with four RL categories: update-control RL, medium-access RL, risk-sensitive RL, and multi-agent RL.

Result: Provides a coherent framework for understanding how learning supports freshness optimization in sampling, scheduling, trajectory planning, medium access, and distributed coordination. Synthesizes recent progress in RL-driven freshness control.

Conclusion: Establishes a unified foundation for learning-based freshness optimization in next-generation wireless networks, highlighting open challenges related to delayed decision processes, stochastic variability, and cross-layer design.

Abstract: The age of information (AoI) has become a central measure of data freshness in modern wireless systems, yet existing surveys either focus on classical AoI formulations or provide broad discussions of reinforcement learning (RL) in wireless networks without addressing freshness as a unified learning problem. Motivated by this gap, this survey examines RL specifically through the lens of AoI and generalized freshness optimization. We organize AoI and its variants into native, function-based, and application-oriented families, providing a clearer view of how freshness should be modeled in B5G and 6G systems. Building on this foundation, we introduce a policy-centric taxonomy that reflects the decisions most relevant to freshness, consisting of update-control RL, medium-access RL, risk-sensitive RL, and multi-agent RL. This structure provides a coherent framework for understanding how learning can support sampling, scheduling, trajectory planning, medium access, and distributed coordination. We further synthesize recent progress in RL-driven freshness control and highlight open challenges related to delayed decision processes, stochastic variability, and cross-layer design. The goal is to establish a unified foundation for learning-based freshness optimization in next-generation wireless networks.

[230] DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction

Khondoker Mirazul Mumenin, Robert Underwood, Dong Dai, Jinzhen Wang, Sheng Di, Zarija Lukić, Franck Cappello

Main category: cs.LG

TL;DR: DeepCQ: A deep-surrogate framework for predicting lossy compression quality that’s generalizable across compressors, metrics, and datasets, with high accuracy and efficiency.

DetailsMotivation: Error-bounded lossy compression is essential for managing large scientific data, but assessing post-compression data quality is computationally expensive due to intensive metric calculations.

Method: Two-stage design decouples expensive feature extraction from lightweight metrics prediction, uses mixture-of-experts for time-evolving data, and is generalizable across compressors, metrics, and datasets.

Result: Exceptional predictive accuracy with errors generally under 10% across most settings, significantly outperforming existing methods on four real-world scientific applications.

Conclusion: DeepCQ empowers scientific users to make informed compression decisions based on preferred data quality, significantly reducing I/O and computational overhead in scientific data analysis.

Abstract: Error-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework’s exceptional predictive accuracy, with prediction errors generally under 10% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis.

[231] dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning

Shirui Chen, Jiantao Jiao, Lillian J. Ratliff, Banghua Zhu

Main category: cs.LG

TL;DR: dUltra is an RL framework that learns optimal unmasking strategies for masked diffusion language models to enable more efficient parallel token generation, improving the accuracy-efficiency trade-off over existing methods.

DetailsMotivation: Current masked diffusion language models (MDLMs) decode too few tokens per forward pass, making their sampling speeds comparable to autoregressive models with speculative decoding. Existing distillation-based accelerators have limitations: they use off-policy trajectories and are constrained by base model quality.

Method: Proposes dUltra, an on-policy reinforcement learning framework using Group Relative Policy Optimization (GRPO). It introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. Jointly optimizes the base diffusion LLM and unmasking order planner using rewards combining verifiable reward, distillation reward, and number of unmasking steps.

Result: dUltra improves the accuracy-efficiency trade-off over state-of-the-art heuristic and distillation baselines across mathematical reasoning and code generation tasks, moving toward achieving “diffusion supremacy” over autoregressive models.

Conclusion: The proposed on-policy RL framework enables more efficient parallel decoding for masked diffusion language models, addressing limitations of existing methods and advancing toward practical advantages over autoregressive approaches.

Abstract: Masked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. As a result, their sampling speeds are often comparable to AR + speculative decoding schemes, limiting their advantage over mainstream autoregressive approaches. Existing distillation-based accelerators (dParallel, d3LLM) finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model’s samples. We propose \texttt{dUltra}, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra improves the accuracy–efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving ``diffusion supremacy’’ over autoregressive models.

[232] An Equivariance Toolbox for Learning Dynamics

Yongyi Yang, Liu Ziyin

Main category: cs.LG

TL;DR: The paper develops a general equivariance toolbox that extends classical Noether-type analyses to provide coupled first- and second-order constraints on neural network learning dynamics, connecting transformation structure to optimization geometry.

DetailsMotivation: Existing analyses of symmetry/equivariance in deep learning are problem-specific and focus mainly on first-order consequences like conservation laws, while second-order structure implications remain less understood. There's a need for a general framework that connects transformation structure to optimization geometry.

Method: Develops a general equivariance toolbox extending classical Noether-type analyses in three directions: 1) from gradient constraints to Hessian constraints, 2) from symmetry to general equivariance, and 3) from continuous to discrete transformations. The framework yields coupled first- and second-order constraints on learning dynamics.

Result: At first order, the framework unifies conservation laws and implicit-bias relations as special cases of a single identity. At second order, it provides structural predictions about curvature: identifies flat/sharp directions, shows how gradient aligns with Hessian eigenspaces, and reveals how loss landscape geometry reflects underlying transformation structure.

Conclusion: The developed equivariance toolbox provides a general framework for understanding how symmetry and equivariance structure learning dynamics in deep learning, connecting transformation properties to both first-order conservation laws and second-order optimization geometry, with applications to modern empirical observations.

Abstract: Many theoretical results in deep learning can be traced to symmetry or equivariance of neural networks under parameter transformations. However, existing analyses are typically problem-specific and focus on first-order consequences such as conservation laws, while the implications for second-order structure remain less understood. We develop a general equivariance toolbox that yields coupled first- and second-order constraints on learning dynamics. The framework extends classical Noether-type analyses in three directions: from gradient constraints to Hessian constraints, from symmetry to general equivariance, and from continuous to discrete transformations. At the first order, our framework unifies conservation laws and implicit-bias relations as special cases of a single identity. At the second order, it provides structural predictions about curvature: which directions are flat or sharp, how the gradient aligns with Hessian eigenspaces, and how the loss landscape geometry reflects the underlying transformation structure. We illustrate the framework through several applications, recovering known results while also deriving new characterizations that connect transformation structure to modern empirical observations about optimization geometry.

[233] RLLaVA: An RL-central Framework for Language and Vision Assistants

Lei Zhao, Zihao Ma, Boyu Lin, Yuhe Liu, Wenjun Wu, Lei Huang

Main category: cs.LG

TL;DR: RLLaVA is a modular RL framework for vision-language assistants that decouples RL algorithms from model architecture, enabling efficient training of 1B-7B models on common GPUs with competitive performance.

DetailsMotivation: The paper aims to create a flexible RL framework that simplifies implementation of new RL algorithms for vision-language models while maintaining efficiency and compatibility with various training/inference engines.

Method: RLLaVA formulates vision-language assistants as Markov decision processes, decoupling RL algorithmic logic from model architecture and distributed execution. It supports plug-and-play integration of various RL methods and VLMs, with resource-efficient training optimized for 1B-7B scale models.

Result: The framework enables training of 4B-scale models with full-parameter updates on a single 24GB GPU. Experiments show improved performance over base models on multi-modal and agentic tasks, competitive with specialized RL frameworks.

Conclusion: RLLaVA provides an effective, efficient, and extensible RL framework for vision-language assistants that democratizes RL research by reducing implementation complexity while maintaining performance and resource efficiency.

Abstract: We present an RL-central framework for Language and Vision Assistants (RLLaVA) with its formulation of Markov decision process (MDP). RLLaVA decouples RL algorithmic logic from model architecture and distributed execution, supporting researchers in implementing new RL algorithms with minimal code, and to plug in a broad family of RL methods and vision-language models (VLMs) while remaining agnostic to specific training and inference engines. RLLaVA makes resource-efficient training of 1B–7B models feasible on common GPUs; notably, 4B-scale models can be trained end-to-end with full-parameter updates on a single 24GB GPU. Experiments on multi-modal and agentic tasks demonstrate that RLLaVA has task extensibility, and the models trained with it consistently improve performance over base models, competitive with other specially engineered RL frameworks. The code is available at https://github.com/TinyLoopX/RLLaVA.

[234] Statistical vs. Deep Learning Models for Estimating Substance Overdose Excess Mortality in the US

Sukanya Krishna, Marie-Laure Charpignon, Maimuna Majumder

Main category: cs.LG

TL;DR: Deep learning models (especially LSTM) outperform traditional SARIMA for estimating excess substance overdose mortality during COVID-19, providing more accurate counterfactual estimates for public health planning.

DetailsMotivation: The COVID-19 pandemic exacerbated substance overdose mortality trends, but traditional statistical methods like SARIMA may not handle structural disruptions well. There's a need for better methods to estimate excess mortality for understanding pandemic impacts and informing interventions.

Method: Systematic comparison of SARIMA against three deep learning architectures (LSTM, Seq2Seq, Transformer) using CDC data (2015-2019 for training, 2020-2023 for projection). The pipeline includes conformal prediction intervals and convergence analysis across 60+ trials per configuration.

Result: LSTM achieved superior point estimation (17.08% MAPE vs. 23.88% for SARIMA) and better-calibrated uncertainty (68.8% vs. 47.9% prediction interval coverage). Attention-based models underperformed due to overfitting to historical means rather than capturing emergent trends.

Conclusion: Carefully validated deep learning models can provide more reliable counterfactual estimates than traditional methods for public health planning, but calibration techniques are essential when deploying neural forecasting in high-stakes domains. The framework is open-source and deployable with state health departments.

Abstract: Substance overdose mortality in the United States claimed over 80,000 lives in 2023, with the COVID-19 pandemic exacerbating existing trends through healthcare disruptions and behavioral changes. Estimating excess mortality, defined as deaths beyond expected levels based on pre-pandemic patterns, is essential for understanding pandemic impacts and informing intervention strategies. However, traditional statistical methods like SARIMA assume linearity, stationarity, and fixed seasonality, which may not hold under structural disruptions. We present a systematic comparison of SARIMA against three deep learning (DL) architectures (LSTM, Seq2Seq, and Transformer) for counterfactual mortality estimation using national CDC data (2015-2019 for training/validation, 2020-2023 for projection). We contribute empirical evidence that LSTM achieves superior point estimation (17.08% MAPE vs. 23.88% for SARIMA) and better-calibrated uncertainty (68.8% vs. 47.9% prediction interval coverage) when projecting under regime change. We also demonstrate that attention-based models (Seq2Seq, Transformer) underperform due to overfitting to historical means rather than capturing emergent trends. Ourreproducible pipeline incorporates conformal prediction intervals and convergence analysis across 60+ trials per configuration, and we provide an open-source framework deployable with 15 state health departments. Our findings establish that carefully validated DL models can provide more reliable counterfactual estimates than traditional methods for public health planning, while highlighting the need for calibration techniques when deploying neural forecasting in high-stakes domains.

[235] When Bayesian Tensor Completion Meets Multioutput Gaussian Processes: Functional Universality and Rank Learning

Siyuan Li, Shikai Fang, Lei Cheng, Feng Yin, Yik-Chung Wu, Peter Gerstoft, Sergios Theodoridis

Main category: cs.LG

TL;DR: Proposes RR-FBTC: a rank-revealing functional Bayesian tensor completion method that automatically determines tensor rank while handling continuous multi-dimensional signals with real-valued indices.

DetailsMotivation: Existing functional tensor decomposition methods assume tensor rank is known, but determining optimal rank is NP-hard. There's limited understanding of expressive power of functional low-rank tensor models for continuous signals.

Method: Models latent functions using multioutput Gaussian processes, enabling automatic tensor rank determination during inference. Uses variational inference with closed-form updates for efficient learning.

Result: Establishes universal approximation property for continuous multi-dimensional signals. Experiments on synthetic and real-world datasets show effectiveness and superiority over state-of-the-art approaches.

Conclusion: RR-FBTC successfully addresses the rank determination problem in functional tensor decomposition while maintaining strong expressive power for continuous signals, with practical efficiency demonstrated through experiments.

Abstract: Functional tensor decomposition can analyze multi-dimensional data with real-valued indices, paving the path for applications in machine learning and signal processing. A limitation of existing approaches is the assumption that the tensor rank-a critical parameter governing model complexity-is known. However, determining the optimal rank is a non-deterministic polynomial-time hard (NP-hard) task and there is a limited understanding regarding the expressive power of functional low-rank tensor models for continuous signals. We propose a rank-revealing functional Bayesian tensor completion (RR-FBTC) method. Modeling the latent functions through carefully designed multioutput Gaussian processes, RR-FBTC handles tensors with real-valued indices while enabling automatic tensor rank determination during the inference process. We establish the universal approximation property of the model for continuous multi-dimensional signals, demonstrating its expressive power in a concise format. To learn this model, we employ the variational inference framework and derive an efficient algorithm with closed-form updates. Experiments on both synthetic and real-world datasets demonstrate the effectiveness and superiority of the RR-FBTC over state-of-the-art approaches. The code is available at https://github.com/OceanSTARLab/RR-FBTC.

[236] MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding

Aiwei Zhang, Arvind Pillai, Andrew Campbell, Nicholas C. Jacobson

Main category: cs.LG

TL;DR: MotionTeller is a framework that converts minute-level wearable activity data into natural language summaries using LLMs, achieving high semantic and lexical accuracy.

DetailsMotivation: As wearable sensing becomes pervasive, there's a need to transform raw physiological signals (like actigraphy data) into natural language summaries for better human interpretation and use in behavioral monitoring and healthcare applications.

Method: MotionTeller integrates a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text autoregressive generation. Trained on 54,383 (actigraphy, text) pairs from NHANES recordings using cross-entropy loss with supervision only on language tokens.

Result: Achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7% in ROUGE-1. Training loss converges to 0.38 by epoch 15. Qualitative analysis shows capture of circadian structure and behavioral transitions, with PCA plots revealing enhanced cluster alignment in embedding space.

Conclusion: MotionTeller is a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, enabling new pathways for behavioral monitoring, clinical review, and personalized health interventions.

Abstract: As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.

[237] Missing Pattern Tree based Decision Grouping and Ensemble for Deep Incomplete Multi-View Clustering

Wenyuan Yang, Jie Xu, Hongqing He, Jiangzhang Gan, Xiaofeng Zhu

Main category: cs.LG

TL;DR: TreeEIC is a novel incomplete multi-view clustering framework that addresses pair under-utilization by grouping data into decision sets based on missing patterns, performing clustering within each set, and using ensemble knowledge distillation for robust performance.

DetailsMotivation: Real-world multi-view data often has highly inconsistent missing patterns, causing pair under-utilization where available multi-view pairs cannot be fully used, limiting existing IMVC methods' effectiveness.

Method: 1) Missing-pattern tree model groups data into decision sets by missing patterns; 2) Multi-view clustering within each set; 3) Multi-view decision ensemble with uncertainty-based weighting; 4) Ensemble-to-individual knowledge distillation with cross-view consistency and inter-cluster discrimination losses.

Result: Extensive experiments on multiple benchmark datasets show TreeEIC achieves state-of-the-art IMVC performance and exhibits superior robustness under highly inconsistent missing patterns.

Conclusion: TreeEIC effectively addresses pair under-utilization in incomplete multi-view clustering by fully exploiting available multi-view pairs through missing-pattern grouping, ensemble aggregation, and knowledge distillation, demonstrating strong performance and robustness.

Abstract: Real-world multi-view data usually exhibits highly inconsistent missing patterns which challenges the effectiveness of incomplete multi-view clustering (IMVC). Although existing IMVC methods have made progress from both imputation-based and imputation-free routes, they have overlooked the pair under-utilization issue, i.e., inconsistent missing patterns make the incomplete but available multi-view pairs unable to be fully utilized, thereby limiting the model performance. To address this, we propose a novel missing-pattern tree based IMVC framework entitled TreeEIC. Specifically, to achieve full exploitation of available multi-view pairs, TreeEIC first defines the missing-pattern tree model to group data into multiple decision sets according to different missing patterns, and then performs multi-view clustering within each set. Furthermore, a multi-view decision ensemble module is proposed to aggregate clustering results from all decision sets, which infers uncertainty-based weights to suppress unreliable clustering decisions and produce robust decisions. Finally, an ensemble-to-individual knowledge distillation module transfers the ensemble knowledge to view-specific clustering models, which enables ensemble and individual modules to promote each other by optimizing cross-view consistency and inter-cluster discrimination losses. Extensive experiments on multiple benchmark datasets demonstrate that our TreeEIC achieves state-of-the-art IMVC performance and exhibits superior robustness under highly inconsistent missing patterns.

[238] Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training

Lei Liu, Hao Zhu, Yue Shen, Zhixuan Chu, Jian Wang, Jinjie Gu, Kui Ren

Main category: cs.LG

TL;DR: Proposes a perplexity-aware data scaling law for continual pre-training that uses model perplexity on domain data to select high-utility subsets, improving data efficiency and performance.

DetailsMotivation: Standard scaling laws show diminishing returns when simply increasing data for continual pre-training, leading to suboptimal data utilization and inefficient training. The paper aims to address this by developing a more sophisticated approach to data selection.

Method: Proposes a novel perplexity-aware data scaling law that establishes a predictive relationship between the perplexity landscape of domain-specific data and test loss. Uses perplexity from pre-trained models on domain data as a proxy for estimating knowledge gaps, quantifying the informational perplexity landscape of candidate training samples. Fits this scaling law across diverse perplexity regimes to enable adaptive selection of high-utility data subsets.

Result: Extensive experiments show the method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks compared to standard approaches.

Conclusion: The proposed perplexity-aware data scaling law effectively addresses the data efficiency problem in continual pre-training by enabling intelligent data selection based on knowledge gaps, leading to better performance with less data.

Abstract: Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the perplexity derived from the pre-trained model on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments demonstrate that our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.

[239] Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data

Hongqing He, Jie Xu, Wenyuan Yang, Yonghua Zhu, Guoqiu Wen, Xiaofeng Zhu

Main category: cs.LG

TL;DR: A unified contrastive learning framework for multi-view clustering that addresses rare-paired and mis-paired issues in incomplete/noisy multi-view data through global-graph guided and local-graph weighted contrastive learning.

DetailsMotivation: Real-world multi-view data often suffers from incompleteness and noise, leading to rare-paired samples (insufficient complementary information) and mis-paired samples (wrong optimization direction), which challenge the effectiveness of contrastive learning for multi-view clustering.

Method: Proposes a unified CL-based MVC framework with two key components: 1) Global-graph guided contrastive learning that constructs a global-view affinity graph from all view samples to form new sample pairs for exploring complementary information, and 2) Local-graph weighted contrastive learning that uses local neighbors to generate pair-wise weights to adaptively strengthen or weaken contrastive learning. The method is imputation-free and integrates both approaches into a unified global-local graph-guided framework.

Result: Extensive experiments on both incomplete and noise settings of multi-view data demonstrate superior performance compared with state-of-the-art approaches.

Conclusion: The proposed unified global-local graph-guided contrastive learning framework effectively addresses rare-paired and mis-paired issues in incomplete and noisy multi-view data, achieving better clustering performance without requiring data imputation.

Abstract: Recently, contrastive learning (CL) plays an important role in exploring complementary information for multi-view clustering (MVC) and has attracted increasing attention. Nevertheless, real-world multi-view data suffer from data incompleteness or noise, resulting in rare-paired samples or mis-paired samples which significantly challenges the effectiveness of CL-based MVC. That is, rare-paired issue prevents MVC from extracting sufficient multi-view complementary information, and mis-paired issue causes contrastive learning to optimize the model in the wrong direction. To address these issues, we propose a unified CL-based MVC framework for enhancing clustering effectiveness on incomplete and noise multi-view data. First, to overcome the rare-paired issue, we design a global-graph guided contrastive learning, where all view samples construct a global-view affinity graph to form new sample pairs for fully exploring complementary information. Second, to mitigate the mis-paired issue, we propose a local-graph weighted contrastive learning, which leverages local neighbors to generate pair-wise weights to adaptively strength or weaken the pair-wise contrastive learning. Our method is imputation-free and can be integrated into a unified global-local graph-guided contrastive learning framework. Extensive experiments on both incomplete and noise settings of multi-view data demonstrate that our method achieves superior performance compared with state-of-the-art approaches.

[240] First Provable Guarantees for Practical Private FL: Beyond Restrictive Assumptions

Egor Shulgin, Grigory Malinovsky, Sarit Khirirat, Peter Richtárik

Main category: cs.LG

TL;DR: Fed-α-NormEC is the first differentially private FL framework with provable convergence and DP guarantees under standard assumptions, supporting practical features like multiple local updates and partial client participation.

DetailsMotivation: Current differentially private federated learning methods rely on unrealistic assumptions (bounded gradients or heterogeneity) and neglect practical FL features like multiple local updates and partial client participation, hindering real-world application.

Method: Fed-α-NormEC integrates local updates (full and incremental gradient steps), separate server and client stepsizes, and crucially supports partial client participation. It provides provable convergence and differential privacy guarantees under standard assumptions.

Result: The framework offers theoretical guarantees for both convergence and differential privacy while supporting practical FL features. Experiments on private deep learning tasks corroborate these theoretical results.

Conclusion: Fed-α-NormEC is the first differentially private FL framework that provides provable guarantees under standard assumptions while fully supporting practical deployment features like partial client participation, which is essential for real-world applications and privacy amplification.

Abstract: Federated Learning (FL) enables collaborative training on decentralized data. Differential privacy (DP) is crucial for FL, but current private methods often rely on unrealistic assumptions (e.g., bounded gradients or heterogeneity), hindering practical application. Existing works that relax these assumptions typically neglect practical FL features, including multiple local updates and partial client participation. We introduce Fed-$α$-NormEC, the first differentially private FL framework providing provable convergence and DP guarantees under standard assumptions while fully supporting these practical features. Fed-$α$-NormE integrates local updates (full and incremental gradient steps), separate server and client stepsizes, and, crucially, partial client participation, which is essential for real-world deployment and vital for privacy amplification. Our theoretical guarantees are corroborated by experiments on private deep learning tasks.

[241] Generative Actor Critic

Aoyang Qin, Deqian Kong, Wei Wang, Ying Nian Wu, Song-Chun Zhu, Sirui Xie

Main category: cs.LG

TL;DR: GAC is a novel RL framework that decouples policy evaluation as learning a generative model of trajectories and returns, and policy improvement as inference on this model, enabling better offline-to-online refinement.

DetailsMotivation: Conventional RL algorithms struggle with refining offline pretrained models using online experiences, especially when focused on expected returns. There's a need for a framework that can better leverage offline pretraining for online improvement.

Method: GAC reframes policy evaluation as learning a generative model p(τ, y) of joint trajectory-return distributions, and policy improvement as versatile inference on this model. It uses a latent variable model with continuous latent plan vectors, with separate inference strategies for exploitation (optimizing latent plans to maximize returns) and exploration (sampling latent plans conditioned on dynamically adjusted target returns).

Result: Experiments on Gym-MuJoCo and Maze2D benchmarks show GAC achieves strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods, even without step-wise rewards.

Conclusion: GAC provides an effective framework for decoupling sequential decision-making through generative modeling and inference, enabling better utilization of offline pretraining for online refinement in RL.

Abstract: Conventional Reinforcement Learning (RL) algorithms, typically focused on estimating or maximizing expected returns, face challenges when refining offline pretrained models with online experiences. This paper introduces Generative Actor Critic (GAC), a novel framework that decouples sequential decision-making by reframing \textit{policy evaluation} as learning a generative model of the joint distribution over trajectories and returns, $p(τ, y)$, and \textit{policy improvement} as performing versatile inference on this learned model. To operationalize GAC, we introduce a specific instantiation based on a latent variable model that features continuous latent plan vectors. We develop novel inference strategies for both \textit{exploitation}, by optimizing latent plans to maximize expected returns, and \textit{exploration}, by sampling latent plans conditioned on dynamically adjusted target returns. Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC’s strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods, even in absence of step-wise rewards.

[242] AVP-Fusion: Adaptive Multi-Modal Fusion and Contrastive Learning for Two-Stage Antiviral Peptide Identification

Xinru Wen, Weizhong Lin, Xuan Xiao

Main category: cs.LG

TL;DR: AVP-Fusion is a two-stage deep learning framework for antiviral peptide identification that uses adaptive feature fusion and contrastive learning to achieve state-of-the-art performance and enable precise viral family/subtype prediction.

DetailsMotivation: Current computational methods for antiviral peptide (AVP) identification struggle with capturing intricate sequence dependencies and handling ambiguous, hard-to-classify samples, which is critical for accelerating novel antiviral drug development.

Method: Two-stage framework: 1) Integrates adaptive feature fusion with panoramic feature space (10 descriptors), Adaptive Gating Mechanism to dynamically regulate CNN-extracted local motifs and BiLSTM-captured global dependencies, and contrastive learning with OHEM and BLOSUM62-based data augmentation. 2) Uses transfer learning for precise subclass prediction across viral families.

Result: Achieves accuracy of 0.9531 and MCC of 0.9064 on benchmark Set 1 dataset, significantly outperforming state-of-the-art methods. Enables precise subclass prediction for six viral families and eight specific viruses even with limited samples.

Conclusion: AVP-Fusion serves as a robust and interpretable tool for high-throughput antiviral drug screening, addressing key challenges in AVP identification through adaptive feature fusion and contrastive learning.

Abstract: Accurate identification of antiviral peptides (AVPs) is critical for accelerating novel drug development. However, current computational methods struggle to capture intricate sequence dependencies and effectively handle ambiguous, hard-to-classify samples. To address these challenges, we propose AVP-Fusion, a novel two-stage deep learning framework integrating adaptive feature fusion and contrastive learning. Unlike traditional static feature concatenation, we construct a panoramic feature space using 10 distinct descriptors and introduce an Adaptive Gating Mechanism.This mechanism dynamically regulates the weights of local motifs extracted by CNNs and global dependencies captured by BiLSTMs based on sequence context. Furthermore, to address data distribution challenges, we employ a contrastive learning strategy driven by Online Hard Example Mining (OHEM) and BLOSUM62-based data augmentation, which significantly sharpens the model’s decision boundaries. Experimental results on the benchmark Set 1 dataset demonstrate that AVP-Fusion achieves an accuracy of 0.9531 and an MCC of 0.9064, significantly outperforming state-of-the-art methods. In the second stage, leveraging transfer learning, the model enables precise subclass prediction for six viral families and eight specific viruses, even under limited sample sizes. In summary, AVP-Fusion serves as a robust and interpretable tool for high-throughput antiviral drug screening.

[243] Discovering Sparse Recovery Algorithms Using Neural Architecture Search

Patrick Yubeaton, Sarthak Gupta, M. Salman Asif, Chinmay Hegde

Main category: cs.LG

TL;DR: Meta-learning framework using Neural Architecture Search (NAS) to automatically discover signal processing algorithms, demonstrated by rediscovering ISTA and FISTA from a large search space.

DetailsMotivation: Designing novel algorithms for inverse problems in signal processing is difficult, heuristic-driven, and time-consuming, motivating the need for automated algorithm discovery.

Method: Developed a meta-learning framework using Neural Architecture Search (NAS) tools to search through a space of over 50,000 variables to rediscover key elements of ISTA and FISTA algorithms.

Result: Successfully rediscovered several key elements of ISTA and FISTA algorithms, and demonstrated the framework’s applicability to various data distributions and other algorithms beyond ISTA/FISTA.

Conclusion: Meta-learning and NAS tools can automate algorithm discovery in signal processing, potentially reducing the time and heuristic effort required for developing novel algorithms.

Abstract: The design of novel algorithms for solving inverse problems in signal processing is an incredibly difficult, heuristic-driven, and time-consuming task. In this short paper, we the idea of automated algorithm discovery in the signal processing context through meta-learning tools such as Neural Architecture Search (NAS). Specifically, we examine the Iterative Shrinkage Thresholding Algorithm (ISTA) and its accelerated Fast ISTA (FISTA) variant as candidates for algorithm rediscovery. We develop a meta-learning framework which is capable of rediscovering (several key elements of) the two aforementioned algorithms when given a search space of over 50,000 variables. We then show how our framework can apply to various data distributions and algorithms besides ISTA/FISTA.

[244] AnchorGK: Anchor-based Incremental and Stratified Graph Learning Framework for Inductive Spatio-Temporal Kriging

Xiaobin Ren, Kaiqi Zhao, Katerina Taškova, Patricia Riddle

Main category: cs.LG

TL;DR: AnchorGK is a novel graph learning framework for spatio-temporal kriging that uses anchor-based stratification to handle sparse sensor deployments and heterogeneous feature availability, outperforming state-of-the-art methods.

DetailsMotivation: Existing spatio-temporal kriging methods under-exploit two practical characteristics: sparse spatial distribution of sensor locations and heterogeneous availability of auxiliary features across locations. This leads to suboptimal performance in real-world sensor network deployments.

Method: AnchorGK introduces anchor locations to stratify data based on feature availability, forming strata around anchors. It uses a dual-view graph learning layer that jointly aggregates feature-relevant and location-relevant information, with an incremental representation mechanism to handle feature incompleteness while preserving informative signals.

Result: Extensive experiments on multiple benchmark datasets demonstrate that AnchorGK consistently outperforms state-of-the-art baselines for spatio-temporal kriging.

Conclusion: AnchorGK effectively addresses the challenges of sparse spatial distributions and heterogeneous feature availability in sensor networks through principled stratification and dual-view graph learning, providing accurate inference in inductive settings.

Abstract: Spatio-temporal kriging is a fundamental problem in sensor networks, driven by the sparsity of deployed sensors and the resulting missing observations. Although recent approaches model spatial and temporal correlations, they often under-exploit two practical characteristics of real deployments: the sparse spatial distribution of locations and the heterogeneous availability of auxiliary features across locations. To address these challenges, we propose AnchorGK, an Anchor-based Incremental and Stratified Graph Learning framework for inductive spatio-temporal kriging. AnchorGK introduces anchor locations to stratify the data in a principled manner. Anchors are constructed according to feature availability, and strata are then formed around these anchors. This stratification serves two complementary roles. First, it explicitly represents and continuously updates correlations between unobserved regions and surrounding observed locations within a graph learning framework. Second, it enables the systematic use of all available features across strata via an incremental representation mechanism, mitigating feature incompleteness without discarding informative signals. Building on the stratified structure, we design a dual-view graph learning layer that jointly aggregates feature-relevant and location-relevant information, learning stratum-specific representations that support accurate inference under inductive settings. Extensive experiments on multiple benchmark datasets demonstrate that AnchorGK consistently outperforms state-of-the-art baselines for spatio-temporal kriging.

[245] RefineBridge: Generative Bridge Models Improve Financial Forecasting by Foundation Models

Anthony Bolton, Wuyang Zhou, Zehua Chen, Giorgos Iacovides, Danilo Mandic

Main category: cs.LG

TL;DR: RefineBridge: A Schrödinger Bridge-based refinement module that improves transformer-based time series foundation models for financial forecasting by learning context-conditioned stochastic transport maps.

DetailsMotivation: Financial time series forecasting is challenging for transformer-based time series foundation models (TSFMs) due to non-stationarity, heavy-tailed distributions, and high-frequency noise. Existing adaptation methods like LoRA underperform because they preserve the original network architecture and training objectives rather than complementing the foundation model.

Method: Proposes RefineBridge, a refinement module built on a tractable Schrödinger Bridge generative framework. It takes TSFM forecasts as generative priors and observed ground truths as targets, learning context-conditioned stochastic transport maps to iteratively improve predictions from even low-quality priors.

Result: Simulations on multiple financial benchmarks show that RefineBridge consistently improves the performance of state-of-the-art TSFMs across different prediction horizons.

Conclusion: RefineBridge effectively enhances TSFMs for financial forecasting by providing a complementary refinement mechanism that addresses the limitations of existing adaptation methods, demonstrating consistent performance improvements across various financial benchmarks.

Abstract: Financial time series forecasting is particularly challenging for transformer-based time series foundation models (TSFMs) due to non-stationarity, heavy-tailed distributions, and high-frequency noise present in data. Low-rank adaptation (LoRA) has become a popular parameter-efficient method for adapting pre-trained TSFMs to downstream data domains. However, it still underperforms in financial data, as it preserves the network architecture and training objective of TSFMs rather than complementing the foundation model. To further enhance TSFMs, we propose a novel refinement module, RefineBridge, built upon a tractable Schrödinger Bridge (SB) generative framework. Given the forecasts of TSFM as generative prior and the observed ground truths as targets, RefineBridge learns context-conditioned stochastic transport maps to improve TSFM predictions, iteratively approaching the ground-truth target from even a low-quality prior. Simulations on multiple financial benchmarks demonstrate that RefineBridge consistently improves the performance of state-of-the-art TSFMs across different prediction horizons.

[246] Videos are Sample-Efficient Supervisions: Behavior Cloning from Videos via Latent Representations

Xin Liu, Haoran Li, Dongbin Zhao

Main category: cs.LG

TL;DR: BCV-LR enables sample-efficient imitation learning from videos by extracting latent actions through self-supervised learning and iterative policy improvement.

DetailsMotivation: Humans can learn efficiently from videos with few trials, but replicating this for autonomous agents is challenging due to visual complexity, lack of action/reward signals, and limited interactions.

Method: BCV-LR extracts action-related latent features from videos via self-supervised tasks, predicts latent actions between frames using dynamics-based unsupervised objectives, then fine-tunes and aligns these to real action space online for policy behavior cloning.

Result: Achieves expert-level performance on some tasks with few interactions, surpassing state-of-the-art ILV baselines and RL methods in sample efficiency across 24/28 tasks.

Conclusion: First demonstration that videos alone can support extremely sample-efficient visual policy learning without expert supervision.

Abstract: Humans can efficiently extract knowledge and learn skills from the videos within only a few trials and errors. However, it poses a big challenge to replicate this learning process for autonomous agents, due to the complexity of visual input, the absence of action or reward signals, and the limitations of interaction steps. In this paper, we propose a novel, unsupervised, and sample-efficient framework to achieve imitation learning from videos (ILV), named Behavior Cloning from Videos via Latent Representations (BCV-LR). BCV-LR extracts action-related latent features from high-dimensional video inputs through self-supervised tasks, and then leverages a dynamics-based unsupervised objective to predict latent actions between consecutive frames. The pre-trained latent actions are fine-tuned and efficiently aligned to the real action space online (with collected interactions) for policy behavior cloning. The cloned policy in turn enriches the agent experience for further latent action finetuning, resulting in an iterative policy improvement that is highly sample-efficient. We conduct extensive experiments on a set of challenging visual tasks, including both discrete control and continuous control. BCV-LR enables effective (even expert-level on some tasks) policy performance with only a few interactions, surpassing state-of-the-art ILV baselines and reinforcement learning methods (provided with environmental rewards) in terms of sample efficiency across 24/28 tasks. To the best of our knowledge, this work for the first time demonstrates that videos can support extremely sample-efficient visual policy learning, without the need to access any other expert supervision.

[247] Robustness and Scalability Of Machine Learning for Imbalanced Clinical Data in Emergency and Critical Care

Yusuf Brima, Marcellin Atemkeng

Main category: cs.LG

TL;DR: Systematic evaluation shows tree-based methods (especially XGBoost) outperform deep learning models on imbalanced clinical tabular data, offering better robustness, scalability, and computational efficiency for emergency care applications.

DetailsMotivation: Emergency and intensive care environments need accurate and computationally efficient predictive models, but clinical data often suffers from severe class imbalance that undermines model reliability for rare but crucial outcomes. Robustness and scalability are essential for real-world clinical usage.

Method: Systematically evaluated robustness and scalability of classical ML models on imbalanced tabular data from MIMIC-IV-ED and eICU datasets. Compared tree-based methods, TabNet deep learning model, and a custom lightweight TabResNet (replacing TabNet’s attention mechanisms with residual architecture). Used Bayesian hyperparameter search and assessed performance, robustness to increasing imbalance, and computational scalability across seven clinically vital predictive tasks.

Result: Tree-based methods, particularly XGBoost, consistently achieved the most stable performance across imbalance levels and scaled efficiently with sample size. Deep tabular models degraded more sharply under imbalance and incurred higher computational costs. TabResNet provided a lighter alternative to TabNet but did not surpass ensemble benchmarks.

Conclusion: In emergency and critical care, robustness to imbalance and computational scalability outweigh architectural complexity. Tree-based ensemble methods currently offer the most practical and clinically feasible choice, providing practitioners with a framework for selecting models suited to high-stakes, time-sensitive environments.

Abstract: Emergency and intensive care environments require predictive models that are both accurate and computationally efficient, yet clinical data in these settings are often severely imbalanced. Such skewness undermines model reliability, particularly for rare but clinically crucial outcomes, making robustness and scalability essential for real-world usage. In this paper, we systematically evaluate the robustness and scalability of classical machine learning models on imbalanced tabular data from MIMIC-IV-ED and eICU. Class imbalance was quantified using complementary metrics, and we compared the performance of tree-based methods, the state-of-the-art TabNet deep learning model, and a custom lightweight residual network. TabResNet was designed as a computationally efficient alternative to TabNet, replacing its complex attention mechanisms with a streamlined residual architecture to maintain representational capacity for real-time clinical use. All models were optimized via a Bayesian hyperparameter search and assessed on predictive performance, robustness to increasing imbalance, and computational scalability. Our results, on seven clinically vital predictive tasks, show that tree-based methods, particularly XGBoost, consistently achieved the most stable performance across imbalance levels and scaled efficiently with sample size. Deep tabular models degraded more sharply under imbalance and incurred higher computational costs, while TabResNet provided a lighter alternative to TabNet but did not surpass ensemble benchmarks. These findings indicate that in emergency and critical care, robustness to imbalance and computational scalability could outweigh architectural complexity. Tree-based ensemble methods currently offer the most practical and clinically feasible choice, equipping practitioners with a framework for selecting models suited to high-stakes, time-sensitive environments.

[248] A Data-Driven Multi-Objective Approach for Predicting Mechanical Performance, Flowability, and Porosity in Ultra-High-Performance Concrete (UHPC)

Jagaran Chakma, Zhiguang Zhou, Jyoti Chakma, Cao YuSen

Main category: cs.LG

TL;DR: A machine learning framework using XGBoost with data preprocessing and feature selection to predict UHPC properties, achieving high accuracy and reducing experimental needs.

DetailsMotivation: To reduce extensive experimental testing in UHPC mix design by developing an accurate predictive model for mechanical performance, flow ability, and porosity.

Method: Two-stage process: 1) Test 21 ML algorithms, select XGBoost as best performer with hyperparameter tuning via Random Search and K-Fold CV; 2) Clean data by removing multicollinear features, identifying outliers with Isolation Forest, selecting important features using SHAP analysis, then retrain XGBoost.

Result: XGBoost achieved high prediction accuracy across all UHPC properties (mechanical performance, flow ability, porosity). Developed GUI for material designers.

Conclusion: The proposed framework significantly improves prediction accuracy and minimizes the need for extensive experimental testing in UHPC mix design.

Abstract: This study presents a data-driven, multi-objective approach to predict the mechanical performance, flow ability, and porosity of Ultra-High-Performance Concrete (UHPC). Out of 21 machine learning algorithms tested, five high-performing models are selected, with XGBoost showing the best accuracy after hyperparameter tuning using Random Search and K-Fold Cross-Validation. The framework follows a two-stage process: the initial XGBoost model is built using raw data, and once selected as the final model, the dataset is cleaned by (1) removing multicollinear features, (2) identifying outliers with Isolation Forest, and (3) selecting important features using SHAP analysis. The refined dataset as model 2 is then used to retrain XGBoost, which achieves high prediction accuracy across all outputs. A graphical user interface (GUI) is also developed to support material designers. Overall, the proposed framework significantly improves the prediction accuracy and minimizes the need for extensive experimental testing in UHPC mix design.

[249] MAD-NG: Meta-Auto-Decoder Neural Galerkin Method for Solving Parametric Partial Differential Equations

Qiuqi Li, Yiting Liu, Jin Zhao, Wencan Zhu

Main category: cs.LG

TL;DR: A novel framework combining Neural Galerkin Method with Meta-Auto-Decoder paradigm for efficient and accurate solution of parametric PDEs, featuring space-time decoupling, meta-learning adaptation, and randomized sparse updates.

DetailsMotivation: Traditional neural network PDE solvers like PINNs and Deep Galerkin Methods struggle with generalization and long-time prediction efficiency due to their dependence on full space-time approximations, especially for parametric PDEs with uncertain or varying parameters.

Method: Enhances Neural Galerkin Method with Meta-Auto-Decoder paradigm, using space-time decoupling for stable time integration, meta-learning for rapid adaptation to unseen parameters, and randomized sparse updates to reduce computational costs.

Result: Achieves physically consistent, long-horizon predictions for complex parameterized evolution equations with significantly lower computational overhead, performing comparatively well in accuracy, robustness, and adaptability on benchmark problems.

Conclusion: The proposed framework successfully addresses limitations of traditional neural PDE solvers by combining Neural Galerkin Method with meta-learning techniques, enabling efficient and accurate solutions for parametric PDEs with improved generalization capabilities.

Abstract: Parametric partial differential equations (PDEs) are fundamental for modeling a wide range of physical and engineering systems influenced by uncertain or varying parameters. Traditional neural network-based solvers, such as Physics-Informed Neural Networks (PINNs) and Deep Galerkin Methods, often face challenges in generalization and long-time prediction efficiency due to their dependence on full space-time approximations. To address these issues, we propose a novel and scalable framework that significantly enhances the Neural Galerkin Method (NGM) by incorporating the Meta-Auto-Decoder (MAD) paradigm. Our approach leverages space-time decoupling to enable more stable and efficient time integration, while meta-learning-driven adaptation allows rapid generalization to unseen parameter configurations with minimal retraining. Furthermore, randomized sparse updates effectively reduce computational costs without compromising accuracy. Together, these advancements enable our method to achieve physically consistent, long-horizon predictions for complex parameterized evolution equations with significantly lower computational overhead. Numerical experiments on benchmark problems demonstrate that our methods performs comparatively well in terms of accuracy, robustness, and adaptability.

[250] Mechanical Strength Prediction of Steel-Polypropylene Fiber-based High-Performance Concrete Using Hybrid Machine Learning Algorithms

Jagaran Chakma, Zhiguang Zhou, Badhan Chakma

Main category: cs.LG

TL;DR: Machine learning models (ET-XGB, RF-LGBM, Transformer-XGB) accurately predict mechanical properties of steel-polypropylene fiber-reinforced HPC, with ET-XGB showing highest overall accuracy and SHAP identifying key influential factors.

DetailsMotivation: To develop accurate and interpretable machine learning models for predicting mechanical properties of steel-polypropylene fiber-reinforced high-performance concrete, enabling better mix design optimization and structural performance evaluation.

Method: Three ensemble ML models (ET-XGB, RF-LGBM, Transformer-XGB) were trained on extensive experimental data using k-fold cross-validation, hyperparameter optimization, SHAP analysis, and uncertainty analysis to predict compressive, flexural, and tensile strengths.

Result: ET-XGB achieved highest overall accuracy (R²: 0.994 CS, 0.944 FS, 0.978 TS) with lowest uncertainty for CS and TS. RF-LGBM provided most stable FS predictions (R² 0.977) with lowest FS uncertainty. SHAP identified fiber aspect ratios, silica fume, and steel fiber content as most influential positive predictors.

Conclusion: Machine learning models can provide accurate, interpretable, and generalizable predictions for HPC mechanical properties, offering valuable tools for concrete mix optimization and structural performance evaluation in engineering applications.

Abstract: This research develops and evaluates machine learning models to predict the mechanical properties of steel-polypropylene fiber-reinforced high-performance concrete (HPC). Three model families were investigated: Extra Trees with XGBoost (ET-XGB), Random Forest with LightGBM (RF-LGBM), and Transformer with XGBoost (Transformer-XGB). The target properties included compressive strength (CS), flexural strength (FS), and tensile strength (TS), based on an extensive dataset compiled from published experimental studies. Model training involved k-fold cross-validation, hyperparameter optimization, Shapley additive explanations (SHAP), and uncertainty analysis to ensure both robustness and interpretability. Among the tested approaches, the ET-XGB model achieved the highest overall accuracy, with testing R^2 values of 0.994 for CS, 0.944 for FS, and 0.978 for TS and exhibited lowest uncertainty for CS and TS (approximately 13-16% and 30.4%, respectively). The RF-LGBM model provided the most stable and reliable predictions for FS (R^2 0.977), yielding the lowest uncertainty for FS (approximately 5-33%). The Transformer-XGB model demonstrated strong predictive capability (R^2 0.978 for TS and 0.967 for FS) but consistently showed the highest uncertainty, indicating reduced generalization reliability. SHAP analysis further indicated that fiber aspect ratios (AR1 and AR2), silica fume (Sfu), and steel fiber content (SF) were the most influential predictors of strength, whereas water content (W) and the water-binder ratio (w/b) consistently had negative effects. The findings confirm that machine learning models can provide accurate, interpretable, and generalizable predictions of HPC mechanical properties. These models offer valuable tools for optimizing concrete mix design and enhancing structural performance evaluation in engineering applications.

[251] Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Maximilian Weichart

Main category: cs.LG

TL;DR: The paper introduces Inverse-RPO, a method to derive prior-based UCTs from any prior-free UCB, creating variance-aware tree policies that outperform PUCT.

DetailsMotivation: While PUCT's prior term improves MCTS exploration, it was derived empirically rather than from first principles. Many alternative UCBs with stronger theoretical guarantees exist but extending them to prior-based UCTs has been challenging.

Method: Building on MCTS as regularized policy optimization (RPO), the authors introduce Inverse-RPO methodology to systematically derive prior-based UCTs from any prior-free UCB. Applied to UCB-V, they obtain two new variance-aware prior-based tree policies.

Result: The variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without additional computational cost. An extension to mctx library shows minimal code changes required.

Conclusion: Inverse-RPO provides a principled framework for deriving prior-based UCTs, enabling variance-aware search policies that improve upon empirically-derived PUCT while maintaining computational efficiency.

Abstract: Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: github.com/Max-We/inverse-rpo.

[252] Causal-HM: Restoring Physical Generative Logic in Multimodal Anomaly Detection via Hierarchical Modulation

Xiao Liu, Junchen Jin, Yanjie Zhao, Zhixuan Xing

Main category: cs.LG

TL;DR: Causal-HM: A multimodal anomaly detection framework for robotic welding that models physical Process-to-Result dependencies using sensor-guided modulation and causal-hierarchical architecture.

DetailsMotivation: Existing multimodal anomaly detection methods suffer from causal blindness by treating process and result modalities equally, ignoring physical generative logic. The heterogeneity gap between high-dimensional visual data and low-dimensional sensor signals causes critical process context to be lost.

Method: Proposes Causal-HM with two innovations: 1) Sensor-Guided CHM Modulation that uses low-dimensional sensor signals as context to guide high-dimensional audio-visual feature extraction, and 2) Causal-Hierarchical Architecture that enforces unidirectional generative mapping to identify anomalies violating physical consistency.

Result: Achieves state-of-the-art I-AUROC of 90.7% on the newly constructed Weld-4M benchmark across four modalities, demonstrating superior performance in multimodal anomaly detection for robotic welding.

Conclusion: Causal-HM effectively addresses causal blindness and heterogeneity issues in multimodal anomaly detection by explicitly modeling physical Process-to-Result dependencies, providing a robust framework for quality assurance in smart manufacturing.

Abstract: Multimodal Unsupervised Anomaly Detection (UAD) is critical for quality assurance in smart manufacturing, particularly in complex processes like robotic welding. However, existing methods often suffer from causal blindness, treating process modalities (e.g., real-time video, audio, and sensors) and result modalities (e.g., post-weld images) as equal feature sources, thereby ignoring the inherent physical generative logic. Furthermore, the heterogeneity gap between high-dimensional visual data and low-dimensional sensor signals frequently leads to critical process context being drowned out. In this paper, we propose Causal-HM, a unified multimodal UAD framework that explicitly models the physical Process to Result dependency. Specifically, our framework incorporates two key innovations: a Sensor-Guided CHM Modulation mechanism that utilizes low-dimensional sensor signals as context to guide high-dimensional audio-visual feature extraction , and a Causal-Hierarchical Architecture that enforces a unidirectional generative mapping to identify anomalies that violate physical consistency. Extensive experiments on our newly constructed Weld-4M benchmark across four modalities demonstrate that Causal-HM achieves a state-of-the-art (SOTA) I-AUROC of 90.7%. Code will be released after the paper is accepted.

[253] Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Dung Anh Hoang, Cuong Pham, Cuong Nguyen, Trung le, Jianfei Cai, Thanh-Toan Do

Main category: cs.LG

TL;DR: A novel data-aware post-training quantization method for 1-bit LLMs that addresses activation error accumulation to outperform existing 1-bit PTQ approaches with minimal overhead.

DetailsMotivation: While LLMs perform well, their large size hinders deployment on resource-constrained devices. 1-bit quantization (converting weights to ±1) is particularly challenging as existing methods suffer significant performance degradation compared to full-precision models. Current 1-bit PTQ methods focus on weight alignment rather than output alignment, which is more intuitive but naive application leads to poor performance.

Method: The paper investigates why output-matching fails in 1-bit LLM quantization and proposes a novel data-aware PTQ approach that explicitly accounts for activation error accumulation while maintaining optimization efficiency.

Result: Empirical experiments demonstrate that the proposed solution consistently outperforms existing 1-bit PTQ methods with minimal overhead.

Conclusion: The paper presents an effective data-aware approach for 1-bit LLM quantization that addresses the limitations of existing output-matching methods by considering activation error accumulation, enabling better performance with low deployment costs.

Abstract: Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even sub-4-bit methods can maintain most of the original model performance. However, 1-bit quantization that converts floating-point weights to (\pm)1, remains particularly challenging, as existing 1-bit PTQ methods often suffer from significant performance degradation compared to the full-precision models. Specifically, most of existing 1-bit PTQ approaches focus on weight alignment, aligning the full-precision model weights with those of the quantized models, rather than directly aligning their outputs. Although the output-matching approach objective is more intuitive and aligns with the quantization goal, naively applying it in 1-bit LLMs often leads to notable performance degradation. In this paper, we investigate why and under what conditions output-matching fails, in the context of 1-bit LLM quantization. Based on our findings, we propose a novel data-aware PTQ approach for 1-bit LLMs that explicitly accounts for activation error accumulation while keeping optimization efficient. Empirical experiments demonstrate that our solution consistently outperforms existing 1-bit PTQ methods with minimal overhead.

[254] An Information Theoretic Perspective on Agentic System Design

Shizhe He, Avanika Narayan, Ishan S. Khare, Scott W. Linderman, Christopher Ré, Dan Biderman

Main category: cs.LG

TL;DR: Agentic LM systems use smaller “compressor” LMs to distill context into compact text for larger “predictor” LMs. The paper introduces an information-theoretic framework using mutual information to quantify compression quality, showing it predicts downstream performance and enables more efficient system design.

DetailsMotivation: Current compressor-predictor LM systems are designed ad hoc without clear guidance on how compressor/predictor choices affect performance. There's a need for task-independent metrics to evaluate compression quality and optimize system design efficiently.

Method: View compressor LM as a noisy channel and use mutual information between context and compression as a task-independent metric. Conduct comprehensive empirical analysis across 5 datasets and 3 model families to study how compressor scaling affects performance.

Result: Mutual information strongly predicts downstream performance. Larger compressors are more accurate, token-efficient, and convey more bits of information per token (e.g., 7B Qwen-2.5 is 1.6× more accurate, 4.6× more concise, conveys 5.5× more bits/token than 1.5B). Scaling compressors is more effective than scaling predictors.

Conclusion: Information-theoretic framework enables principled design of compressor-predictor systems. Larger on-device compressors paired with smaller cloud predictors can achieve near-frontier accuracy at significantly reduced costs (e.g., 3B local compressors recover 99% accuracy at 26% API costs).

Abstract: Agentic language model (LM) systems power modern applications like “Deep Research” and “Claude Code,” and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller “compressor” LMs (that can even run locally) distill raw context into compact text that is then consumed by larger “predictor” LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is $1.6\times$ more accurate, $4.6\times$ more concise, and conveys $5.5\times$ more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover $99%$ of frontier-LM accuracy at $26%$ of API costs.

[255] Dictionary-Transform Generative Adversarial Networks

Angshul Majumdar

Main category: cs.LG

TL;DR: DT-GAN is a model-based adversarial framework using sparse dictionaries and transforms, providing theoretical guarantees and stable training unlike neural GANs.

DetailsMotivation: Classical GANs have ill-posed objectives, unstable training dynamics, and limited interpretability. The authors aim to create a theoretically sound adversarial framework with rigorous guarantees.

Method: DT-GAN uses a sparse synthesis dictionary as generator and an analysis transform as discriminator. Both are linear operators with explicit constraints, enabling mathematical analysis of equilibria and identifiability.

Result: DT-GAN has well-posed adversarial games with Nash equilibria, provable identifiability under sparse models, finite-sample stability, and consistent structure recovery on synthetic data.

Conclusion: Adversarial learning can be made interpretable, stable, and provably correct when grounded in classical sparse modeling, offering a principled alternative for sparse-structured distributions.

Abstract: Generative adversarial networks (GANs) are widely used for distribution learning, yet their classical formulations remain theoretically fragile, with ill-posed objectives, unstable training dynamics, and limited interpretability. In this work, we introduce \emph{Dictionary-Transform Generative Adversarial Networks} (DT-GAN), a fully model-based adversarial framework in which the generator is a sparse synthesis dictionary and the discriminator is an analysis transform acting as an energy model. By restricting both players to linear operators with explicit constraints, DT-GAN departs fundamentally from neural GAN architectures and admits rigorous theoretical analysis. We show that the DT-GAN adversarial game is well posed and admits at least one Nash equilibrium. Under a sparse generative model, equilibrium solutions are provably identifiable up to standard permutation and sign ambiguities and exhibit a precise geometric alignment between synthesis and analysis operators. We further establish finite-sample stability and consistency of empirical equilibria, demonstrating that DT-GAN training converges reliably under standard sampling assumptions and remains robust in heavy-tailed regimes. Experiments on mixture-structured synthetic data validate the theoretical predictions, showing that DT-GAN consistently recovers underlying structure and exhibits stable behavior under identical optimization budgets where a standard GAN degrades. DT-GAN is not proposed as a universal replacement for neural GANs, but as a principled adversarial alternative for data distributions that admit sparse synthesis structure. The results demonstrate that adversarial learning can be made interpretable, stable, and provably correct when grounded in classical sparse modeling.

[256] RIPCN: A Road Impedance Principal Component Network for Probabilistic Traffic Flow Forecasting

Haochen Lv, Yan Lin, Shengnan Guo, Xiaowei Mao, Hong Nie, Letian Gong, Youfang Lin, Huaiyu Wan

Main category: cs.LG

TL;DR: RIPCN is a probabilistic traffic flow forecasting model that integrates transportation theory with spatiotemporal learning to address uncertainty estimation challenges by modeling road impedance dynamics and capturing spatiotemporal uncertainty correlations.

DetailsMotivation: Accurate traffic flow forecasting with uncertainty estimation is crucial for intelligent transportation services. Existing probabilistic traffic flow forecasting approaches struggle with two key challenges: (1) uncovering and modeling the causes of traffic flow uncertainty for reliable forecasting, and (2) capturing spatiotemporal correlations of uncertainty for accurate prediction.

Method: RIPCN (Road Impedance Principal Component Network) integrates domain-specific transportation theory with spatiotemporal principal component learning. It includes: (1) a dynamic impedance evolution network that captures directional traffic transfer patterns driven by road congestion level and flow variability to reveal uncertainty causes, and (2) a principal component network that forecasts dominant eigenvectors of future flow covariance to capture spatiotemporal uncertainty correlations.

Result: Experimental results on real-world datasets show that RIPCN outperforms existing probabilistic forecasting methods in both uncertainty estimation accuracy and point prediction performance.

Conclusion: RIPCN successfully addresses the key challenges in probabilistic traffic flow forecasting by combining transportation theory with machine learning, providing both reliable uncertainty estimation and improved point predictions through interpretable modeling of uncertainty causes and spatiotemporal correlations.

Abstract: Accurate traffic flow forecasting is crucial for intelligent transportation services such as navigation and ride-hailing. In such applications, uncertainty estimation in forecasting is important because it helps evaluate traffic risk levels, assess forecast reliability, and provide timely warnings. As a result, probabilistic traffic flow forecasting (PTFF) has gained significant attention, as it produces both point forecasts and uncertainty estimates. However, existing PTFF approaches still face two key challenges: (1) how to uncover and model the causes of traffic flow uncertainty for reliable forecasting, and (2) how to capture the spatiotemporal correlations of uncertainty for accurate prediction. To address these challenges, we propose RIPCN, a Road Impedance Principal Component Network that integrates domain-specific transportation theory with spatiotemporal principal component learning for PTFF. RIPCN introduces a dynamic impedance evolution network that captures directional traffic transfer patterns driven by road congestion level and flow variability, revealing the direct causes of uncertainty and enhancing both reliability and interpretability. In addition, a principal component network is designed to forecast the dominant eigenvectors of future flow covariance, enabling the model to capture spatiotemporal uncertainty correlations. This design allows for accurate and efficient uncertainty estimation while also improving point prediction performance. Experimental results on real-world datasets show that our approach outperforms existing probabilistic forecasting methods.

[257] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

Chiwun Yang

Main category: cs.LG

TL;DR: The paper provides a theoretical foundation for scaling laws in LLMs by analyzing transformer training dynamics as ODEs, revealing a phase transition in generalization error decay from exponential to power-law as computational resources increase.

DetailsMotivation: While scaling laws are empirically validated in LLM development, their theoretical foundations remain poorly understood. The paper aims to provide rigorous theoretical analysis of how computational resources affect transformer model performance during training.

Method: Formalizes transformer learning dynamics as an ordinary differential equation (ODE) system, approximating the process to kernel behaviors. Analyzes stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions.

Result: Establishes a theoretical upper bound on excess risk with a distinct phase transition: initial optimization phase shows exponential decay relative to computational cost, then transitions to statistical phase with power-law decay of Θ(C^{-1/6}). Also derives isolated scaling laws for model size, training time, and dataset size.

Conclusion: The work provides a unified theoretical framework for understanding scaling laws in transformers, explaining the phase transition in generalization error decay and offering insights into how different computational resources independently govern performance bounds.

Abstract: The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.

[258] Dynamic Feedback Engines: Layer-Wise Control for Self-Regulating Continual Learning

Hengyi Wu, Zhenyi Wang, Heng Huang

Main category: cs.LG

TL;DR: Proposes an entropy-aware continual learning method that dynamically regulates layers based on their entropy to balance underfitting and overfitting, improving generalization by converging to wider local minima.

DetailsMotivation: Most continual learning methods treat all layers uniformly, trading stability for plasticity or vice versa, but different layers naturally exhibit varying uncertainty levels. High-entropy layers underfit while low-entropy layers overfit, creating an imbalance that needs addressing.

Method: Uses a dynamic feedback mechanism to regulate each layer based on its entropy: reduces entropy in high-entropy layers to mitigate underfitting, and increases entropy in overly confident layers to alleviate overfitting. This adaptive regulation encourages convergence to wider local minima for better generalization.

Result: Experiments on various datasets demonstrate substantial performance gains over state-of-the-art continual learning baselines.

Conclusion: The proposed entropy-aware method effectively balances underfitting and overfitting in continual learning by adaptively regulating layers based on their entropy, leading to improved generalization and performance across different datasets and learning approaches.

Abstract: Continual learning aims to acquire new tasks while preserving performance on previously learned ones, but most methods struggle with catastrophic forgetting. Existing approaches typically treat all layers uniformly, often trading stability for plasticity or vice versa. However, different layers naturally exhibit varying levels of uncertainty (entropy) when classifying tasks. High-entropy layers tend to underfit by failing to capture task-specific patterns, while low-entropy layers risk overfitting by becoming overly confident and specialized. To address this imbalance, we propose an entropy-aware continual learning method that employs a dynamic feedback mechanism to regulate each layer based on its entropy. Specifically, our approach reduces entropy in high-entropy layers to mitigate underfitting and increases entropy in overly confident layers to alleviate overfitting. This adaptive regulation encourages the model to converge to wider local minima, which have been shown to improve generalization. Our method is general and can be seamlessly integrated with both replay- and regularization-based approaches. Experiments on various datasets demonstrate substantial performance gains over state-of-the-art continual learning baselines.

[259] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

Shuyu Gan, Renxiang Wang, James Mooney, Dongyeop Kang

Main category: cs.LG

TL;DR: A2P-Vis is a two-part multi-agent pipeline that automatically transforms raw datasets into professional data-visualization reports by combining quality-assured analysis with narrative generation.

DetailsMotivation: Current AI agents for end-to-end data science pipelines fail to generate insightful, diverse visual evidence and assemble it into coherent, professional reports, limiting their real-world usefulness for practitioners.

Method: Two-part multi-agent pipeline: 1) Data Analyzer orchestrates profiling, proposes visualization directions, generates/executes plotting code, filters low-quality figures with legibility checker, and scores candidate insights for depth, correctness, specificity, and actionability. 2) Presenter orders topics, composes chart-grounded narratives from top-ranked insights, writes transitions, and revises for clarity and consistency.

Result: The system converts raw data into curated materials (charts + vetted insights) and readable narratives without manual intervention, producing publication-ready reports that operationalize co-analysis end-to-end.

Conclusion: By coupling quality-assured analysis with narrative presentation, A2P-Vis improves the real-world usefulness of automated data analysis for practitioners through operationalized co-analysis.

Abstract: Automating end-to-end data science pipeline with AI agents still stalls on two gaps: generating insightful, diverse visual evidence and assembling it into a coherent, professional report. We present A2P-Vis, a two-part, multi-agent pipeline that turns raw datasets into a high-quality data-visualization report. The Data Analyzer orchestrates profiling, proposes diverse visualization directions, generates and executes plotting code, filters low-quality figures with a legibility checker, and elicits candidate insights that are automatically scored for depth, correctness, specificity, depth and actionability. The Presenter then orders topics, composes chart-grounded narratives from the top-ranked insights, writes justified transitions, and revises the document for clarity and consistency, yielding a coherent, publication-ready report. Together, these agents convert raw data into curated materials (charts + vetted insights) and into a readable narrative without manual glue work. We claim that by coupling a quality-assured Analyzer with a narrative Presenter, A2P-Vis operationalizes co-analysis end-to-end, improving the real-world usefulness of automated data analysis for practitioners. For the complete dataset report, please see: https://www.visagent.org/api/output/f2a3486d-2c3b-4825-98d4-5af25a819f56.

[260] A Model of Causal Explanation on Neural Networks for Tabular Data

Takashi Isozaki, Masahiro Yamamoto, Atsushi Noda

Main category: cs.LG

TL;DR: CENNET is a causal explanation method for neural network predictions on tabular data that addresses pseudo-correlation and causality issues using structural causal models combined with neural networks, with a new entropy-based explanation power index.

DetailsMotivation: The need to explain machine learning results, especially for neural networks on tabular data, addressing issues of pseudo-correlation, causality, and combinatorial reasons in predictions.

Method: CENNET combines structural causal models (SCMs) with neural networks to provide causal explanations, using a new entropy-based explanation power index. SCMs are effectively integrated with NNs despite typically not being used as standalone predictive models.

Result: CENNET successfully provides causal explanations through comparative experiments with existing methods on both synthetic and quasi-real data in classification tasks.

Conclusion: CENNET offers an effective approach for causal explanation of neural network predictions on tabular data by leveraging structural causal models and introducing a novel entropy-based evaluation metric.

Abstract: The problem of explaining the results produced by machine learning methods continues to attract attention. Neural network (NN) models, along with gradient boosting machines, are expected to be utilized even in tabular data with high prediction accuracy. This study addresses the related issues of pseudo-correlation, causality, and combinatorial reasons for tabular data in NN predictors. We propose a causal explanation method, CENNET, and a new explanation power index using entropy for the method. CENNET provides causal explanations for predictions by NNs and uses structural causal models (SCMs) effectively combined with the NNs although SCMs are usually not used as predictive models on their own in terms of predictive accuracy. We show that CEN-NET provides such explanations through comparative experiments with existing methods on both synthetic and quasi-real data in classification tasks.

[261] Approximation Capabilities of Feedforward Neural Networks with GELU Activations

Konstantin Yakovlev, Nikita Puchkin

Main category: cs.LG

TL;DR: The paper provides simultaneous approximation error bounds for functions and all their derivatives using GELU neural networks, with guarantees for polynomials, exponentials, and reciprocals.

DetailsMotivation: To develop neural network approximation theory that provides simultaneous error bounds for both functions and their higher-order derivatives, which is important for applications requiring accurate derivative approximations like physics-informed neural networks and numerical analysis.

Method: Constructive approximation using feedforward neural networks with GELU activation, starting with multiplication approximation and extending to division and exponential functions, with analysis of network size, weight magnitudes, and asymptotic behavior.

Result: Derived simultaneous error bounds for functions and all their derivatives up to any prescribed order, with guarantees for multivariate polynomials, exponential function, and reciprocal function, including network size specifications and weight magnitude bounds.

Conclusion: GELU neural networks can simultaneously approximate functions and their derivatives with provable error bounds, enabling reliable derivative approximations in practical applications while maintaining global boundedness of higher-order derivatives.

Abstract: We derive an approximation error bound that holds simultaneously for a function and all its derivatives up to any prescribed order. The bounds apply to elementary functions, including multivariate polynomials, the exponential function, and the reciprocal function, and are obtained using feedforward neural networks with the Gaussian Error Linear Unit (GELU) activation. In addition, we report the network size, weight magnitudes, and behavior at infinity. Our analysis begins with a constructive approximation of multiplication, where we prove the simultaneous validity of error bounds over domains of increasing size for a given approximator. Leveraging this result, we obtain approximation guarantees for division and the exponential function, ensuring that all higher-order derivatives of the resulting approximators remain globally bounded.

[262] VAMP-Net: An Interpretable Multi-Path Framework of Genomic Permutation-Invariant Set Attention and Quality-Aware 1D-CNN for MTB Drug Resistance

Aicha Boutorh, Kamar Hibatallah Baghdadi, Anais Daoud

Main category: cs.LG

TL;DR: VAMP-Net: A dual-path neural network for interpretable drug resistance prediction in tuberculosis that combines variant set analysis with sequencing quality assessment to achieve >95% accuracy.

DetailsMotivation: Current genomic prediction of drug resistance in Mycobacterium tuberculosis faces challenges from complex epistatic interactions and variable sequencing data quality, limiting clinical utility.

Method: Two complementary pathways: Path-1 uses Set Attention Transformer for permutation-invariant variant sets to capture epistatic interactions; Path-2 uses 1D CNN to analyze VCF quality metrics for adaptive confidence scoring. Fusion module combines both for final classification.

Result: Superior performance over baseline models with >95% accuracy and ~97% AUC for Rifampicin and Rifabutin resistance prediction. Dual interpretability: Attention weights reveal epistatic networks, Integrated Gradients identify critical loci (rpoB), and CNN gradients show quality metric dependencies.

Conclusion: VAMP-Net establishes a new paradigm for robust, clinically-actionable resistance prediction by delivering state-of-the-art performance with dual-layer interpretability at genetic causality and technical confidence levels.

Abstract: Genomic prediction of drug resistance in Mycobacterium tuberculosis remains challenging due to complex epistatic interactions and highly variable sequencing data quality. We present a novel Interpretable Variant-Aware Multi-Path Network (VAMP-Net) that addresses both challenges through complementary machine learning pathways. Path-1 employs a Set Attention Transformer processing permutation-invariant variant sets to capture epistatic interactions between genomic loci. Path-2 utilizes a 1D Convolutional Neural Network that analyzes Variant Call Format quality metrics to learn adaptive confidence scores. A fusion module combines both pathways for final resistance classification. We conduct comparative evaluations of unmasked versus padding-masked Set Attention Blocks, and demonstrate that our multi-path architecture achieves superior performance over baseline CNN and MLP models, with accuracy exceeding 95% and AUC around 97% for Rifampicin (RIF) and Rifabutin (RFB) resistance prediction. The framework provides dual-layer interpretability: Attention Weight Analysis reveals Epistatic networks, and Integrated Gradients (IG) was applied for critical resistance loci (notably rpoB), while gradient-based feature importance from the CNN pathway uncovers drug-specific dependencies on data quality metrics. This architecture advances clinical genomics by delivering state-of-the-art predictive performance alongside auditable interpretability at two distinct levels, genetic causality of mutation sets and technical confidence of sequencing evidence, establishing a new paradigm for robust, clinically-actionable resistance prediction.

[263] Synthetic Financial Data Generation for Enhanced Financial Modelling

Christophe D. Hounwanou, Yae Ulrich Gaba, Pierre Ntakirutimana

Main category: cs.LG

TL;DR: A unified evaluation framework for synthetic financial data comparing ARIMA-GARCH, VAEs, and TimeGAN on S&P 500 data, finding TimeGAN offers the best balance between realism and temporal coherence.

DetailsMotivation: Data scarcity and confidentiality in finance hinder model development and testing, creating a need for reliable synthetic data generation methods that can overcome these limitations while maintaining data utility.

Method: Developed a multi-criteria evaluation framework assessing fidelity (MMD), temporal structure (autocorrelation, volatility clustering), and practical utility in downstream tasks (portfolio optimization, volatility forecasting). Applied to three generative paradigms using S&P 500 daily data.

Result: ARIMA-GARCH captures linear trends and conditional volatility but fails at nonlinear dynamics; VAEs produce smooth trajectories underestimating extreme events; TimeGAN achieves best trade-off with lowest MMD (1.84e-3) and best temporal coherence.

Conclusion: TimeGAN provides optimal balance for synthetic financial data generation. The paper offers practical guidelines for model selection based on application needs and computational constraints, with a unified evaluation protocol to standardize benchmarking in the field.

Abstract: Data scarcity and confidentiality in finance often impede model development and robust testing. This paper presents a unified multi-criteria evaluation framework for synthetic financial data and applies it to three representative generative paradigms: the statistical ARIMA-GARCH baseline, Variational Autoencoders (VAEs), and Time-series Generative Adversarial Networks (TimeGAN). Using historical S and P 500 daily data, we evaluate fidelity (Maximum Mean Discrepancy, MMD), temporal structure (autocorrelation and volatility clustering), and practical utility in downstream tasks, specifically mean-variance portfolio optimization and volatility forecasting. Empirical results indicate that ARIMA-GARCH captures linear trends and conditional volatility but fails to reproduce nonlinear dynamics; VAEs produce smooth trajectories that underestimate extreme events; and TimeGAN achieves the best trade-off between realism and temporal coherence (e.g., TimeGAN attained the lowest MMD: 1.84e-3, average over 5 seeds). Finally, we articulate practical guidelines for selecting generative models according to application needs and computational constraints. Our unified evaluation protocol and reproducible codebase aim to standardize benchmarking in synthetic financial data research.

[264] Smart IoT-Based Leak Forecasting and Detection for Energy-Efficient Liquid Cooling in AI Data Centers

Krishna Chaitanya Sunkara, Rambabu Konakanchi

Main category: cs.LG

TL;DR: Smart IoT system combines LSTM for leak forecasting and Random Forest for detection in liquid-cooled AI data centers, achieving 96.5% detection accuracy and 87% forecasting accuracy, potentially preventing 1,500 kWh annual energy waste.

DetailsMotivation: Liquid-cooled AI data centers face substantial energy loss from coolant leaks causing unplanned shutdowns and extended repairs. Current systems lack proactive monitoring to prevent these costly incidents.

Method: IoT monitoring system using LSTM neural networks for probabilistic leak forecasting and Random Forest classifiers for instant detection. System employs MQTT streaming, InfluxDB storage, and Streamlit dashboards, tested on synthetic data aligned with ASHRAE 2021 standards.

Result: Achieves 96.5% detection accuracy and 87% forecasting accuracy at 90% probability within ±30-minute windows. System forecasts leaks 2-4 hours ahead and identifies sudden events within 1 minute. Humidity, pressure, and flow rate provide strong predictive signals while temperature shows minimal immediate response due to thermal inertia.

Conclusion: The approach could prevent roughly 1,500 kWh annual energy waste for a typical 47-rack facility through proactive maintenance. While validation is synthetic-only, results establish feasibility for future operational deployment in sustainable data center operations.

Abstract: AI data centers which are GPU centric, have adopted liquid cooling to handle extreme heat loads, but coolant leaks result in substantial energy loss through unplanned shutdowns and extended repair periods. We present a proof-of-concept smart IoT monitoring system combining LSTM neural networks for probabilistic leak forecasting with Random Forest classifiers for instant detection. Testing on synthetic data aligned with ASHRAE 2021 standards, our approach achieves 96.5% detection accuracy and 87% forecasting accuracy at 90% probability within plus or minus 30-minute windows. Analysis demonstrates that humidity, pressure, and flow rate deliver strong predictive signals, while temperature exhibits minimal immediate response due to thermal inertia in server hardware. The system employs MQTT streaming, InfluxDB storage, and Streamlit dashboards, forecasting leaks 2-4 hours ahead while identifying sudden events within 1 minute. For a typical 47-rack facility, this approach could prevent roughly 1,500 kWh annual energy waste through proactive maintenance rather than reactive emergency procedures. While validation remains synthetic-only, results establish feasibility for future operational deployment in sustainable data center operations.

[265] A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville

Main category: cs.LG

TL;DR: Analysis of KL divergence estimation methods in RL fine-tuning of LLMs, showing that unbiased gradient estimators lead to better performance and stability compared to biased ones.

DetailsMotivation: Despite widespread use of KL regularization in RL training of LLMs, there's no systematic study of different KL estimation methods and their impact on model performance. Current practices often have discrepancies between stated objectives and actual implementations.

Method: Analyzed gradient properties of various KL estimator configurations, then empirically tested them by RL fine-tuning Qwen2.5-7B, Llama-3.1-8B-Instruct, and Qwen3-4B-Instruct-2507 models with different configurations on in- and out-of-distribution tasks.

Result: In on-policy settings: (1) biased gradient estimators cause training instability; (2) unbiased gradient estimators lead to better performance on both in-domain and out-of-domain tasks. In off-policy settings, KL regularization helps stabilize training in asynchronous setups.

Conclusion: Proper KL estimator configuration with unbiased gradients is crucial for stable RL training and better performance of LLMs, while biased estimators can cause instability and suboptimal results.

Abstract: The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

[266] Secure and Explainable Fraud Detection in Finance via Hierarchical Multi-source Dataset Distillation

Yiming Qian, Thorsten Neumann, Xueyining Huang, David Hardoon, Fei Gao, Yong Liu, Siow Mong Rick Goh

Main category: cs.LG

TL;DR: Proposes explainable, privacy-preserving dataset distillation for fraud detection using random forest rule regions to generate synthetic data, maintaining performance while reducing data volume by 85-93%.

DetailsMotivation: Need for collaborative fraud detection across financial institutions while preserving privacy, enabling explainability, and reducing data volume for regulatory compliance and auditability.

Method: Convert trained random forest into transparent axis-aligned rule regions (leaf hyperrectangles), uniformly sample within each region to generate synthetic transactions, creating compact surrogate dataset.

Result: Distilled datasets reduce data volume by 85-93% while maintaining competitive precision and micro-F1; improves cross-cluster performance; achieves 93% structural similarity; membership-inference attacks at chance level (0.50).

Conclusion: Tree-region distillation enables trustworthy fraud analytics with interpretable rules, per-case rationales with uncertainty quantification, strong privacy properties suitable for multi-institution settings and regulatory audit.

Abstract: We propose an explainable, privacy-preserving dataset distillation framework for collaborative financial fraud detection. A trained random forest is converted into transparent, axis-aligned rule regions (leaf hyperrectangles), and synthetic transactions are generated by uniformly sampling within each region. This produces a compact, auditable surrogate dataset that preserves local feature interactions without exposing sensitive original records. The rule regions also support explainability: aggregated rule statistics (for example, support and lift) describe global patterns, while assigning each case to its generating region gives concise human-readable rationales and calibrated uncertainty based on tree-vote disagreement. On the IEEE-CIS fraud dataset (590k transactions across three institution-like clusters), distilled datasets reduce data volume by 85% to 93% (often under 15% of the original) while maintaining competitive precision and micro-F1, with only a modest AUC drop. Sharing and augmenting with synthesized data across institutions improves cross-cluster precision, recall, and AUC. Real vs. synthesized structure remains highly similar (over 93% by nearest-neighbor cosine analysis). Membership-inference attacks perform at chance level (about 0.50) when distinguishing training from hold-out records, suggesting low memorization risk. Removing high-uncertainty synthetic points using disagreement scores further boosts AUC (up to 0.687) and improves calibration. Sensitivity tests show weak dependence on the distillation ratio (AUC about 0.641 to 0.645 from 6% to 60%). Overall, tree-region distillation enables trustworthy, deployable fraud analytics with interpretable global rules, per-case rationales with quantified uncertainty, and strong privacy properties suitable for multi-institution settings and regulatory audit.

[267] MMCTOP: A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction

Carolina Aparício, Qi Shi, Bo Wen, Tesfaye Yadete, Qiwei Han

Main category: cs.LG

TL;DR: MMCTOP is a multimodal framework for clinical trial outcome prediction that integrates molecular structures, protocol metadata, eligibility narratives, and disease ontologies using schema-guided textualization and a drug-disease-conditioned sparse Mixture-of-Experts transformer.

DetailsMotivation: Addressing the challenge of multimodal data fusion in high-dimensional biomedical informatics, particularly integrating heterogeneous biomedical signals for clinical trial outcome prediction.

Method: Combines schema-guided textualization and input-fidelity validation with modality-aware representation learning. Uses domain-specific encoders to generate aligned embeddings fused by a transformer backbone with drug-disease-conditioned sparse Mixture-of-Experts (SMoE) using top-k routing for specialization across therapeutic and design subspaces.

Result: Achieves consistent improvements in precision, F1, and AUC over unimodal and multimodal baselines on benchmark datasets. Ablations show schema-guided textualization and selective expert routing contribute materially to performance and stability. Temperature scaling provides calibrated probabilities for reliable risk estimation.

Conclusion: MMCTOP advances multimodal trial modeling by combining controlled narrative normalization, context-conditioned expert fusion, and operational safeguards for auditability and reproducibility in biomedical informatics.

Abstract: Addressing the challenge of multimodal data fusion in high-dimensional biomedical informatics, we propose MMCTOP, a MultiModal Clinical-Trial Outcome Prediction framework that integrates heterogeneous biomedical signals spanning (i) molecular structure representations, (ii) protocol metadata and long-form eligibility narratives, and (iii) disease ontologies. MMCTOP couples schema-guided textualization and input-fidelity validation with modality-aware representation learning, in which domain-specific encoders generate aligned embeddings that are fused by a transformer backbone augmented with a drug-disease-conditioned sparse Mixture-of-Experts (SMoE). This design explicitly supports specialization across therapeutic and design subspaces while maintaining scalable computation through top-k routing. MMCTOP achieves consistent improvements in precision, F1, and AUC over unimodal and multimodal baselines on benchmark datasets, and ablations show that schema-guided textualization and selective expert routing contribute materially to performance and stability. We additionally apply temperature scaling to obtain calibrated probabilities, ensuring reliable risk estimation for downstream decision support. Overall, MMCTOP advances multimodal trial modeling by combining controlled narrative normalization, context-conditioned expert fusion, and operational safeguards aimed at auditability and reproducibility in biomedical informatics.

[268] GQ-VAE: A gated quantized VAE for learning variable length tokens

Theo Datta, Kayla Huang, Sham Kakade, David Brandfonbrener

Main category: cs.LG

TL;DR: GQ-VAE is a neural tokenizer that learns variable-length discrete tokens and can replace existing tokenizers without major architecture changes, improving compression and language modeling performance.

DetailsMotivation: Current frontier models use deterministic BPE tokenization, but neural tokenizers require complex changes to model architecture. There's a need for a learned tokenizer that can be easily integrated as a drop-in replacement.

Method: Propose Gated Quantized Variational Autoencoder (GQ-VAE) that learns to encode variable-length discrete tokens and can be independently pre-trained to serve as a drop-in tokenizer replacement.

Result: GQ-VAE improves compression and language modeling over standard VQ-VAE, approaches BPE performance, and when compression is matched, GQ-VAE improves downstream language model learning compared to BPE with smaller vocabulary.

Conclusion: GQ-VAE offers a promising neural tokenization approach that can replace existing tokenizers without major architectural changes, with several exciting avenues for future work.

Abstract: While most frontier models still use deterministic frequency-based tokenization algorithms such as byte-pair encoding (BPE), there has been significant recent work to design learned neural tokenizers. However, these schemes generally add to underlying language model complexity and force large changes to architecture, making them hard to implement at large scales. To overcome these challenges, we propose the gated quantized variational autoencoder (GQ-VAE), a novel architecture that can be independently pre-trained to serve as a drop-in replacement for existing tokenizers. The key innovation of the architecture is to learn to encode variable-length discrete tokens. GQ-VAE improves compression and language modeling performance over a standard VQ-VAE tokenizer, and approaches the compression rate and language modeling performance of BPE. Interestingly, if we use BPE with a smaller vocabulary, such that the compression is equivalent between GQ-VAE and BPE, we find that GQ-VAE improves downstream language model learning. We conclude with a discussion of several exciting avenues for future work. Code can be found at https://github.com/Theo-Datta-115/gq-vae.

[269] Exploring the Heterogeneity of Tabular Data: A Diversity-aware Data Generator via LLMs

Yafeng Tang, Xiaoou Ding, Jianzhuo Du, Zishuo Yan, Zhuang Ma, Zheng Liang, Zekai Qian, Hongzhi Wang

Main category: cs.LG

TL;DR: DATE is a framework for generating diverse, high-quality tabular data by partitioning heterogeneous data, using LLMs with decision tree feedback, and balancing diversity-quality trade-offs via Multi-Arm Bandit sampling.

DetailsMotivation: Real-world tabular data is heterogeneous with diverse distributions, making it challenging to create a single generative model that works well for all data types. Existing approaches struggle with this heterogeneity and the trade-off between diversity and quality in generated data.

Method: 1) Partition heterogeneous data into diverse subsets; 2) Use LLMs with decision tree reasoning as feedback to generate labeled data for each subset; 3) Employ Multi-Arm Bandit-based sampling algorithm to balance diversity and quality (proving greedy selection doesn’t work in heterogeneous settings).

Result: DATE outperforms state-of-the-art GAN-based and LLM-based methods on tabular classification/regression benchmarks, achieving 23.75% average error rate reduction with just 100 generated samples. Generated data also improves DPO accuracy and enhances LLM reasoning on target data.

Conclusion: DATE effectively addresses tabular data heterogeneity by combining data partitioning, LLM-based generation with decision tree feedback, and intelligent sampling to balance diversity and quality, enabling robust machine learning applications with limited generated data.

Abstract: Tabular data generation has become increasingly essential for enabling robust machine learning applications, which require large-scale, high-quality data. Existing solutions leverage generative models to learn original data distributions. However, real-world data are naturally heterogeneous with diverse distributions, making it challenging to obtain a universally good model for diverse data generation. To address this limitation, we introduce Diversity-Aware Tabular data gEnerator (DATE), a framework that (i) prepares high-quality and distributionally distinct examples for in-context learning by effectively partitioning the original heterogeneous data into multiple diverse subsets; (ii) harnesses Large Language Models (LLMs) to explore the diversity of the partitioned distribution with decision tree reasoning as feedback, generating high-quality labeled data for each subset. However, the massive generated data inherently involves a trade-off between diversity and quality. To integrate this issue, existing solutions greedily select the validation-best data. However, we prove that the selection in heterogeneous settings does not possess the greedy-choice property, and design a Multi-Arm Bandit-based sampling algorithm that balances the diversity and quality of generated data. Extensive experiments on tabular classification and regression benchmarks demonstrate that DATE consistently outperforms state-of-the-art GAN-based and LLM-based methods. On average, DATE achieves a 23.75% reduction in error rate with just 100 generated data. Empirically, we demonstrate that data generated by DATE can improve the accuracy of Direct Preference Optimization (DPO) and enhance the reasoning capability of LLMs on the target data. Code is available at https://github.com/windblow32/DATE.

[270] Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Nathan Kallus

Main category: cs.LG

TL;DR: The paper proposes robust policy alignment methods for LLMs that don’t assume a known link function between preferences and rewards, developing link-agnostic learners with finite-sample error bounds.

DetailsMotivation: Current preference alignment methods assume a known link function (like logistic in Bradley-Terry models), but if this assumption is wrong, it leads to biased rewards and misaligned policies. The paper aims to develop robust methods that work with unknown and unrestricted link functions.

Method: The authors formulate policy alignment as an f-divergence-constrained reward maximization problem, showing it implies a semiparametric single-index binary choice model. They develop three types of policy learners: 1) profiling the link function, 2) orthogonalizing the link function, and 3) using link-agnostic bipartite ranking objectives. The methods use first-order optimization suitable for neural networks and batched data.

Result: The paper provides finite-sample policy error bounds that depend on functional complexity measures of the index class. The methods are robust to unknown preference noise distribution and scale, and can directly optimize policies without explicitly fitting rewards.

Conclusion: The proposed link-agnostic policy alignment methods offer robustness against misspecified link functions while maintaining the ability to directly optimize policies, making them more reliable for aligning large language models to preference data.

Abstract: Aligning large language models to preference data is commonly implemented by assuming a known link function between the distribution of observed preferences and the unobserved rewards (e.g., a logistic link as in Bradley-Terry). If the link is wrong, however, inferred rewards can be biased and policies be misaligned. We study policy alignment to preferences under an unknown and unrestricted link. We consider an $f$-divergence-constrained reward maximization problem and show that realizability of the solution in a policy class implies a semiparametric single-index binary choice model, where a scalar-valued index determined by a policy captures the dependence on demonstrations and the rest of the preference distribution is an unrestricted function thereof. Rather than focus on estimation of identifiable finite-dimensional structural parameters in the index as in econometrics, we focus on policy learning, focusing on error to the optimal policy and allowing unidentifiable and nonparametric indices. We develop a variety of policy learners based on profiling the link function, orthogonalizing the link function, and using link-agnostic bipartite ranking objectives. We analyze these and provide finite-sample policy error bounds that depend on generic functional complexity measures of the index class. We further consider practical implementations using first-order optimization suited to neural networks and batched data. The resulting methods are robust to unknown preference noise distribution and scale, while preserving the direct optimization of policies without explicitly fitting rewards.

[271] Hybrid Combinatorial Multi-armed Bandits with Probabilistically Triggered Arms

Kongchang Zhou, Tingyu Zhang, Wei Chen, Fang Kong

Main category: cs.LG

TL;DR: Hybrid CMAB-T framework combines offline data with online interaction to overcome limitations of purely online or offline combinatorial multi-armed bandits with probabilistically triggered arms.

DetailsMotivation: Online CMAB-T suffers from high interaction costs and slow adaptation, while offline methods are constrained by dataset quality and lack exploration. The authors aim to address these complementary weaknesses by integrating both paradigms.

Method: Proposes hybrid CMAB-T framework and hybrid CUCB algorithm that leverages offline data to guide exploration and accelerate convergence, while strategically incorporating online interactions to mitigate dataset limitations.

Result: Theoretical guarantees show hybrid CUCB significantly outperforms purely online approaches with high-quality offline data, and effectively corrects bias in offline-only methods with limited/misaligned data. Empirical results demonstrate consistent advantages.

Conclusion: Hybrid CMAB-T successfully integrates offline and online learning, overcoming limitations of each paradigm individually and providing a principled approach to leverage both data sources for improved performance.

Abstract: The problem of combinatorial multi-armed bandits with probabilistically triggered arms (CMAB-T) has been extensively studied. Prior work primarily focuses on either the online setting where an agent learns about the unknown environment through iterative interactions, or the offline setting where a policy is learned solely from logged data. However, each of these paradigms has inherent limitations: online algorithms suffer from high interaction costs and slow adaptation, while offline methods are constrained by dataset quality and lack of exploration capabilities. To address these complementary weaknesses, we propose hybrid CMAB-T, a new framework that integrates offline data with online interaction in a principled manner. Our proposed hybrid CUCB algorithm leverages offline data to guide exploration and accelerate convergence, while strategically incorporating online interactions to mitigate the insufficient coverage or distributional bias of the offline dataset. We provide theoretical guarantees on the algorithm’s regret, demonstrating that hybrid CUCB significantly outperforms purely online approaches when high-quality offline data is available, and effectively corrects the bias inherent in offline-only methods when the data is limited or misaligned. Empirical results further demonstrate the consistent advantage of our algorithm.

[272] DuaDeep-SeqAffinity: Dual-Stream Deep Learning Framework for Sequence-Only Antigen-Antibody Affinity Prediction

Aicha Boutorh, Soumia Bouyahiaoui, Sara Belhadj, Nour El Yakine Guendouz, Manel Kara Laouar

Main category: cs.LG

TL;DR: DuaDeep-SeqAffinity is a sequence-only deep learning framework that predicts antibody-antigen binding affinity using ESM-2 embeddings, CNN for local motifs, and Transformer for global context, outperforming both sequence-only and structure-sequence hybrid methods.

DetailsMotivation: Traditional affinity prediction methods rely on scarce and expensive 3D structures. There's a need for scalable, structure-independent approaches that can handle large sequence libraries for drug discovery and vaccine development.

Method: Dual-stream hybrid architecture using pre-trained ESM-2 protein language model embeddings. Combines 1D CNNs for local motif detection with Transformer encoders for global contextual representation, followed by fusion module and fully connected network for regression.

Result: Superior performance with Pearson correlation of 0.688, R² of 0.460, RMSE of 0.737, and AUC of 0.890. Outperforms single-branch variants (ESM-CNN, ESM-Transformer) and existing SOTA methods, including structure-sequence hybrid models.

Conclusion: Sequence-only models can capture essential binding patterns without 3D structures, providing scalable, efficient solutions for high-throughput screening that accelerate therapeutic discovery pipelines.

Abstract: Predicting the binding affinity between antigens and antibodies is fundamental to drug discovery and vaccine development. Traditional computational approaches often rely on experimentally determined 3D structures, which are scarce and computationally expensive to obtain. This paper introduces DuaDeep-SeqAffinity, a novel sequence-only deep learning framework that predicts affinity scores solely from their amino acid sequences using a dual-stream hybrid architecture. Our approach leverages pre-trained ESM-2 protein language model embeddings, combining 1D Convolutional Neural Networks (CNNs) for local motif detection with Transformer encoders for global contextual representation. A subsequent fusion module integrates these multi-faceted features, which are then passed to a fully connected network for final score regression. Experimental results demonstrate that DuaDeep-SeqAffinity significantly outperforms individual architectural components and existing state-of-the-art (SOTA) methods. DuaDeep achieved a superior Pearson correlation of 0.688, an R^2 of 0.460, and a Root Mean Square Error (RMSE) of 0.737, surpassing single-branch variants ESM-CNN and ESM-Transformer. Notably, the model achieved an Area Under the Curve (AUC) of 0.890, outperforming sequence-only benchmarks and even surpassing structure-sequence hybrid models. These findings prove that high-fidelity sequence embeddings can capture essential binding patterns typically reserved for structural modeling. By eliminating the reliance on 3D structures, DuaDeep-SeqAffinity provides a highly scalable and efficient solution for high-throughput screening of vast sequence libraries, significantly accelerating the therapeutic discovery pipeline.

[273] HWL-HIN: A Hypergraph-Level Hypergraph Isomorphism Network as Powerful as the Hypergraph Weisfeiler-Lehman Test with Application to Higher-Order Network Robustness

Chengyu Tian, Wenbin Pei

Main category: cs.LG

TL;DR: Proposes Hypergraph Isomorphism Network (HIN) framework for hypergraph robustness prediction, achieving theoretical expressive power equivalent to Hypergraph Weisfeiler-Lehman test while outperforming existing methods.

DetailsMotivation: Conventional attack-based robustness assessment is computationally expensive, and existing deep learning methods (CNNs, GNNs) fail to capture complex higher-order correlations in real-world systems that are naturally modeled as hypergraphs. Current Hypergraph Neural Networks (HGNNs) have limited topological expressive power.

Method: Proposes a hypergraph-level Hypergraph Isomorphism Network framework inspired by Graph Isomorphism Networks. Theoretically proven to have expressive power strictly equivalent to the Hypergraph Weisfeiler-Lehman test, and applied to predict hypergraph robustness.

Result: Experimental results show the proposed method maintains superior training and prediction efficiency while outperforming existing graph-based models and significantly surpassing conventional HGNNs in tasks prioritizing topological structure representation.

Conclusion: The Hypergraph Isomorphism Network framework provides an effective solution for hypergraph robustness prediction with strong theoretical guarantees and practical performance advantages over existing approaches.

Abstract: Robustness in complex systems is of significant engineering and economic importance. However, conventional attack-based a posteriori robustness assessments incur prohibitive computational overhead. Recently, deep learning methods, such as Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs), have been widely employed as surrogates for rapid robustness prediction. Nevertheless, these methods neglect the complex higher-order correlations prevalent in real-world systems, which are naturally modeled as hypergraphs. Although Hypergraph Neural Networks (HGNNs) have been widely adopted for hypergraph learning, their topological expressive power has not yet reached the theoretical upper bound. To address this limitation, inspired by Graph Isomorphism Networks, this paper proposes a hypergraph-level Hypergraph Isomorphism Network framework. Theoretically, this approach is proven to possess an expressive power strictly equivalent to the Hypergraph Weisfeiler-Lehman test and is applied to predict hypergraph robustness. Experimental results demonstrate that while maintaining superior efficiency in training and prediction, the proposed method not only outperforms existing graph-based models but also significantly surpasses conventional HGNNs in tasks that prioritize topological structure representation.

[274] Direction Finding with Sparse Arrays Based on Variable Window Size Spatial Smoothing

Wesley S. Leite, Rodrigo C. de Lamare, Yuriy Zakharov, Wei Liu, Martin Haardt

Main category: cs.LG

TL;DR: VWS spatial smoothing framework improves DOA estimation for sparse arrays by compressing smoothing aperture, replacing perturbed terms with unperturbed low-rank terms to enhance signal-noise separation.

DetailsMotivation: Coarray-based DOA estimation for sparse linear arrays suffers from performance limitations due to perturbed rank-one outer products in smoothed coarray data, which reduces separation between signal and noise subspaces.

Method: Proposes VWS-CA-MUSIC and VWS-CA-rMUSIC algorithms that use variable window size spatial smoothing to compress the smoothing aperture, replacing perturbed outer products with unperturbed low-rank additional terms while preserving signal subspace span.

Result: Derives identifiability bounds for compression parameter; simulations show significant performance improvements and complexity savings compared to fixed-window coarray MUSIC method for sparse array geometries.

Conclusion: VWS spatial smoothing framework effectively enhances coarray-based DOA estimation by improving signal-noise subspace separation while reducing computational complexity, making it suitable for sparse array applications.

Abstract: In this work, we introduce a variable window size (VWS) spatial smoothing framework that enhances coarray-based direction of arrival (DOA) estimation for sparse linear arrays. By compressing the smoothing aperture, the proposed VWS Coarray MUSIC (VWS-CA-MUSIC) and VWS Coarray root-MUSIC (VWS-CA-rMUSIC) algorithms replace part of the perturbed rank-one outer products in the smoothed coarray data with unperturbed low-rank additional terms, increasing the separation between signal and noise subspaces, while preserving the signal subspace span. We also derive the bounds that guarantees identifiability, by limiting the values that can be assumed by the compression parameter. Simulations with sparse geometries reveal significant performance improvements and complexity savings relative to the fixed-window coarray MUSIC method.

[275] LibContinual: A Comprehensive Library towards Realistic Continual Learning

Wenbin Li, Shangge Liu, Borui Kang, Yiyang Chen, KaXuan Lew, Yang Chen, Yinghuan Shi, Lei Wang, Yang Gao, Jiebo Luo

Main category: cs.LG

TL;DR: LibContinual is a unified library for continual learning that standardizes evaluation and reveals performance gaps in existing methods under realistic constraints.

DetailsMotivation: Continual learning research suffers from fragmentation - inconsistent implementations, conflicting dependencies, and varying evaluation protocols make fair comparisons and reproducibility difficult. Existing evaluations often rely on unrealistic assumptions that overestimate real-world applicability.

Method: Developed LibContinual, a comprehensive library with high-cohesion, low-coupling modular architecture integrating 19 representative algorithms across 5 methodological categories. Used this framework to systematically investigate three implicit assumptions: offline data accessibility, unregulated memory resources, and intra-task semantic homogeneity.

Result: When subjected to strict online CL settings, unified memory budget protocols, and category-randomized settings, many representative CL methods show significant performance drops. The study reveals that existing methods are less effective under realistic constraints than previously assumed.

Conclusion: Resource-aware and semantically robust CL strategies are necessary for real-world applications. LibContinual serves as a foundational toolkit for future research in realistic continual learning, addressing the fragmentation problem and enabling standardized evaluation.

Abstract: A fundamental challenge in Continual Learning (CL) is catastrophic forgetting, where adapting to new tasks degrades the performance on previous ones. While the field has evolved with diverse methods, this rapid surge in diverse methodologies has culminated in a fragmented research landscape. The lack of a unified framework, including inconsistent implementations, conflicting dependencies, and varying evaluation protocols, makes fair comparison and reproducible research increasingly difficult. To address this challenge, we propose LibContinual, a comprehensive and reproducible library designed to serve as a foundational platform for realistic CL. Built upon a high-cohesion, low-coupling modular architecture, LibContinual integrates 19 representative algorithms across five major methodological categories, providing a standardized execution environment. Meanwhile, leveraging this unified framework, we systematically identify and investigate three implicit assumptions prevalent in mainstream evaluation: (1) offline data accessibility, (2) unregulated memory resources, and (3) intra-task semantic homogeneity. We argue that these assumptions often overestimate the real-world applicability of CL methods. Through our comprehensive analysis using strict online CL settings, a novel unified memory budget protocol, and a proposed category-randomized setting, we reveal significant performance drops in many representative CL methods when subjected to these real-world constraints. Our study underscores the necessity of resource-aware and semantically robust CL strategies, and offers LibContinual as a foundational toolkit for future research in realistic continual learning. The source code is available from \href{https://github.com/RL-VIG/LibContinual}{https://github.com/RL-VIG/LibContinual}.

[276] From In Silico to In Vitro: Evaluating Molecule Generative Models for Hit Generation

Nagham Osman, Vittorio Lembo, Giovanni Bottegoni, Laura Toni

Main category: cs.LG

TL;DR: Generative models can effectively generate hit-like molecules as a standalone task in drug discovery, with synthesized GSK-3β hits showing in vitro activity, though current evaluation metrics and training data have limitations.

DetailsMotivation: Traditional hit identification in drug discovery is resource-intensive, and while generative models show promise for molecular generation, using ML to replace the entire pipeline is challenging. This study investigates whether generative models can specifically replace the hit-like molecule generation step as a standalone task.

Method: Proposed an evaluation framework with physicochemical, structural, and bioactivity criteria to define hit-like chemical space. Benchmarked two autoregressive and one diffusion-based generative models across various datasets and training settings. Used standard metrics and target-specific docking scores for assessment, with synthesized GSK-3β hits tested in vitro.

Result: Models generated valid, diverse, and biologically relevant compounds across multiple targets. Selected GSK-3β hits were synthesized and confirmed active in vitro. Identified limitations in current evaluation metrics and available training data.

Conclusion: Generative models can successfully generate hit-like molecules as a standalone task to support or substitute traditional hit identification workflows, though improvements in evaluation metrics and training data are needed.

Abstract: Hit identification is a critical yet resource-intensive step in the drug discovery pipeline, traditionally relying on high-throughput screening of large compound libraries. Despite advancements in virtual screening, these methods remain time-consuming and costly. Recent progress in deep learning has enabled the development of generative models capable of learning complex molecular representations and generating novel compounds de novo. However, using ML to replace the entire drug-discovery pipeline is highly challenging. In this work, we rather investigate whether generative models can replace one step of the pipeline: hit-like molecule generation. To the best of our knowledge, this is the first study to explicitly frame hit-like molecule generation as a standalone task and empirically test whether generative models can directly support this stage of the drug discovery pipeline. Specifically, we investigate if such models can be trained to generate hit-like molecules, enabling direct incorporation into, or even substitution of, traditional hit identification workflows. We propose an evaluation framework tailored to this task, integrating physicochemical, structural, and bioactivity-related criteria within a multi-stage filtering pipeline that defines the hit-like chemical space. Two autoregressive and one diffusion-based generative models were benchmarked across various datasets and training settings, with outputs assessed using standard metrics and target-specific docking scores. Our results show that these models can generate valid, diverse, and biologically relevant compounds across multiple targets, with a few selected GSK-3$β$ hits synthesized and confirmed active in vitro. We also identify key limitations in current evaluation metrics and available training data.

[277] Why Smooth Stability Assumptions Fail for ReLU Learning

Ronald Katende

Main category: cs.LG

TL;DR: ReLU networks violate smoothness assumptions needed for classical stability analyses, requiring nonsmooth-aware frameworks.

DetailsMotivation: Modern learning systems stability analyses rely on smoothness assumptions that ReLU networks violate, creating a gap between theoretical stability requirements and practical empirical observations.

Method: The paper provides a concrete counterexample demonstrating failure of classical stability bounds for ReLU networks, identifies minimal generalized derivative conditions, and shows why smooth ReLU approximations are misleading.

Result: No uniform smoothness-based stability proxy (gradient Lipschitzness or Hessian control) can hold globally for ReLU networks, even in simple empirically stable settings.

Conclusion: Classical smoothness-based stability analyses fail for ReLU networks, motivating the need for nonsmooth-aware stability frameworks that work with generalized derivative conditions.

Abstract: Stability analyses of modern learning systems are frequently derived under smoothness assumptions that are violated by ReLU-type nonlinearities. In this note, we isolate a minimal obstruction by showing that no uniform smoothness-based stability proxy such as gradient Lipschitzness or Hessian control can hold globally for ReLU networks, even in simple settings where training trajectories appear empirically stable. We give a concrete counterexample demonstrating the failure of classical stability bounds and identify a minimal generalized derivative condition under which stability statements can be meaningfully restored. The result clarifies why smooth approximations of ReLU can be misleading and motivates nonsmooth-aware stability frameworks.

[278] Scaling Adversarial Training via Data Selection

Youran Ye, Dejin Wang, Ajinkya Bhandare

Main category: cs.LG

TL;DR: Selective Adversarial Training reduces computational cost by only perturbing critical samples in each minibatch, achieving comparable robustness with 50% less adversarial computation.

DetailsMotivation: PGD adversarial training is computationally expensive because all training samples undergo identical iterative optimization despite contributing unequally to robustness. This inefficiency motivates a more selective approach.

Method: Proposes Selective Adversarial Training with two selection criteria: 1) margin-based sampling (prioritizes samples near decision boundary), and 2) gradient-matching sampling (selects samples whose gradients align with dominant batch optimization direction). Only selected subset gets adversarial examples, while remaining samples are trained cleanly using mixed objective.

Result: Experiments on MNIST and CIFAR-10 show comparable or better robustness than full PGD adversarial training while reducing adversarial computation by up to 50%.

Conclusion: Informed sample selection is sufficient for scalable adversarial robustness, demonstrating that not all samples need adversarial perturbation to achieve effective robustness.

Abstract: Projected Gradient Descent (PGD) is a strong and widely used first-order adversarial attack, yet its computational cost scales poorly, as all training samples undergo identical iterative inner-loop optimization despite contributing unequally to robustness. Motivated by this inefficiency, we propose \emph{Selective Adversarial Training}, which perturbs only a subset of critical samples in each minibatch. Specifically, we introduce two principled selection criteria: (1) margin-based sampling, which prioritizes samples near the decision boundary, and (2) gradient-matching sampling, which selects samples whose gradients align with the dominant batch optimization direction. Adversarial examples are generated only for the selected subset, while the remaining samples are trained cleanly using a mixed objective. Experiments on MNIST and CIFAR-10 show that the proposed methods achieve robustness comparable to, or even exceeding, full PGD adversarial training, while reducing adversarial computation by up to $50%$, demonstrating that informed sample selection is sufficient for scalable adversarial robustness.

[279] Explainable Multimodal Regression via Information Decomposition

Zhaozhao Ma, Shujian Yu

Main category: cs.LG

TL;DR: A novel multimodal regression framework using Partial Information Decomposition (PID) to quantify modality contributions and interactions, with Gaussianity assumptions enabling analytical computation and improved interpretability.

DetailsMotivation: Existing multimodal regression methods lack principled tools to disentangle and quantify individual modality contributions and their interactions, limiting interpretability of multimodal fusion.

Method: Proposes a multimodal regression framework grounded in Partial Information Decomposition (PID) that decomposes modality-specific representations into unique, redundant, and synergistic components. Introduces Gaussianity assumptions in the joint distribution of latent representations and transformed response variable to resolve PID underdetermination, and derives a closed-form conditional independence regularizer to isolate unique information.

Result: Experiments on six real-world datasets, including brain age prediction from multimodal neuroimaging data, demonstrate superior performance over state-of-the-art methods in both predictive accuracy and interpretability, while enabling informed modality selection for efficient inference.

Conclusion: The proposed PID-based framework provides a principled approach for quantifying modality contributions in multimodal regression, improving both accuracy and interpretability while facilitating modality selection for efficient inference.

Abstract: Multimodal regression aims to predict a continuous target from heterogeneous input sources and typically relies on fusion strategies such as early or late fusion. However, existing methods lack principled tools to disentangle and quantify the individual contributions of each modality and their interactions, limiting the interpretability of multimodal fusion. We propose a novel multimodal regression framework grounded in Partial Information Decomposition (PID), which decomposes modality-specific representations into unique, redundant, and synergistic components. The basic PID framework is inherently underdetermined. To resolve this, we introduce inductive bias by enforcing Gaussianity in the joint distribution of latent representations and the transformed response variable (after inverse normal transformation), thereby enabling analytical computation of the PID terms. Additionally, we derive a closed-form conditional independence regularizer to promote the isolation of unique information within each modality. Experiments on six real-world datasets, including a case study on large-scale brain age prediction from multimodal neuroimaging data, demonstrate that our framework outperforms state-of-the-art methods in both predictive accuracy and interpretability, while also enabling informed modality selection for efficient inference. Implementation is available at https://github.com/zhaozhaoma/PIDReg.

[280] SCALA: Split Federated Learning with Concatenated Activations and Logit Adjustments

Jiarong Yang, Yuan Liu

Main category: cs.LG

TL;DR: SCALA improves Split Federated Learning by concatenating client activations and applying logit adjustments to handle label distribution skew from data heterogeneity and partial client participation.

DetailsMotivation: Data heterogeneity and partial client participation in Split Federated Learning cause label distribution skew, which severely degrades learning performance. Existing SFL approaches struggle with this issue.

Method: Proposes SCALA: 1) Concatenates activations from client-side models as input to server-side model to centrally adjust label distribution across clients, 2) Performs logit adjustments of loss functions on both server-side and client-side models to handle label distribution variation across different subsets of participating clients.

Result: Theoretical analysis and experimental results on public datasets verify the superiority of SCALA over existing approaches in handling label distribution skew in Split Federated Learning.

Conclusion: SCALA effectively addresses label distribution skew in Split Federated Learning through concatenated activations and logit adjustments, improving learning performance in heterogeneous and partially participating client environments.

Abstract: Split Federated Learning (SFL) is a distributed machine learning framework which strategically divides the learning process between a server and clients and collaboratively trains a shared model by aggregating local models updated based on data from distributed clients. However, data heterogeneity and partial client participation result in label distribution skew, which severely degrades the learning performance. To address this issue, we propose SFL with Concatenated Activations and Logit Adjustments (SCALA). Specifically, the activations from the client-side models are concatenated as the input of the server-side model so as to centrally adjust label distribution across different clients, and logit adjustments of loss functions on both server-side and client-side models are performed to deal with the label distribution variation across different subsets of participating clients. Theoretical analysis and experimental results verify the superiority of the proposed SCALA on public datasets.

[281] Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge

Joshua Shay Kricheli, Khoa Vo, Aniruddha Datta, Spencer Ozgur, Paulo Shakarian

Main category: cs.LG

TL;DR: Proposes learning explainable error detection rules from ML model failures to recover constraints for hierarchical multi-label classification, relaxing the need for a-priori constraints.

DetailsMotivation: Current neurosymbolic HMC approaches require pre-existing constraints, which is a strong assumption. The paper aims to relax this by learning constraints from model failure patterns.

Method: Uses Error Detection Rules (EDR) to learn explainable rules about ML model failure modes, then leverages these rules as constraints for hierarchical multi-label classification.

Result: The approach effectively detects ML errors, recovers constraints, is noise tolerant, and serves as knowledge source for neurosymbolic models across multiple datasets including a new military vehicle recognition dataset.

Conclusion: EDR-based approach successfully learns explainable constraints from model failures, enabling constraint recovery for HMC without requiring a-priori constraints.

Abstract: Recent advances in Hierarchical Multi-label Classification (HMC), particularly neurosymbolic-based approaches, have demonstrated improved consistency and accuracy by enforcing constraints on a neural model during training. However, such work assumes the existence of such constraints a-priori. In this paper, we relax this strong assumption and present an approach based on Error Detection Rules (EDR) that allow for learning explainable rules about the failure modes of machine learning models. We show that these rules are not only effective in detecting when a machine learning classifier has made an error but also can be leveraged as constraints for HMC, thereby allowing the recovery of explainable constraints even if they are not provided. We show that our approach is effective in detecting machine learning errors and recovering constraints, is noise tolerant, and can function as a source of knowledge for neurosymbolic models on multiple datasets, including a newly introduced military vehicle recognition dataset.

[282] Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

Main category: cs.LG

TL;DR: LLMs can perform reinforcement learning during inference through multi-round prompting with reward feedback, enabling self-improvement on various tasks.

DetailsMotivation: To discover and leverage the emergent reinforcement learning capabilities of large language models during inference time for self-improvement on tasks.

Method: ICRL prompting: multi-round framework where LLMs receive numerical scalar feedback (rewards) after each response, with subsequent prompts including all prior responses and rewards, allowing the model to optimize based on reward signals.

Result: Response quality consistently improves as context grows; significant improvements over baselines (Self-Refine, Reflexion) on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions; works even when reward signals are generated by the same LLM.

Conclusion: LLMs exhibit emergent in-context reinforcement learning capabilities during inference, enabling self-improvement through reward optimization, representing a promising paradigm for test-time scaling.

Abstract: Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

[283] Revisiting Bi-Encoder Neural Search: An Encoding–Searching Separation Perspective

Hung-Nghiep Tran, Akiko Aizawa, Atsuhiro Takasu

Main category: cs.LG

TL;DR: The paper analyzes bi-encoder architecture issues in neural search, identifies encoding bottleneck and embedding search limitations, proposes an encoding-searching separation perspective to improve performance and reduce training costs.

DetailsMotivation: Bi-encoder architectures are widely used for neural search due to simplicity and scalability, but suffer from low performance on seen datasets and weak zero-shot performance on new datasets. The paper aims to understand and address these fundamental limitations.

Method: The paper analyzes bi-encoder issues, identifies two main critiques (encoding information bottleneck and limitations of embedding search assumptions), conducts thought experiments to challenge basic assumptions, and proposes a new encoding-searching separation perspective that conceptually separates encoding and searching operations.

Result: The proposed encoding-searching separation framework explains root causes of existing issues, suggests mitigation strategies, potentially lowers training costs, and improves retrieval performance. It also exposes new design surfaces for neural search systems.

Conclusion: The encoding-searching separation perspective provides a new framework for understanding and improving bi-encoder architectures, with broader implications for neural search design and promising research directions for future work.

Abstract: This paper reviews, analyzes, and proposes a new perspective on the bi-encoder architecture for neural search. While the bi-encoder architecture is widely used due to its simplicity and scalability at test time, it has some notable issues such as low performance on seen datasets and weak zero-shot performance on new datasets. In this paper, we analyze these issues and summarize two main critiques: the encoding information bottleneck problem and limitations of the basic assumption of embedding search. We then construct a thought experiment to logically analyze the encoding and searching operations and challenge the basic assumptions of embedding search. Building on these observations, we propose a new perspective on the bi-encoder architecture called the \textit{encoding–searching separation} perspective, which conceptually and practically separates the encoding and searching operations. This framework is applied to explain the root cause of existing issues and suggest mitigation strategies, potentially lowering training costs and improving retrieval performance. Finally, we discuss the broader implications of the ideas underlying this perspective, the new design surface it exposes, and potential research directions arising from it.

[284] Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors

Saad Masrur, Jung-Fu, Cheng, Atieh R. Khamesi, Ismail Guvenc

Main category: cs.LG

TL;DR: A novel tokenization method (SST) and lightweight Transformer model (L-SwiGLU-T) for indoor localization in NLOS environments that reduces computational complexity while improving accuracy.

DetailsMotivation: Traditional indoor localization methods struggle in NLOS environments, and while deep learning helps, existing approaches are computationally intensive and unsuitable for resource-limited devices. Transformer models show promise but are computationally heavy for localization tasks.

Method: Two key innovations: 1) Sensor Snapshot Tokenization (SST) that preserves variable-specific representations of power delay profile and captures multi-variate correlations, and 2) L-SwiGLU-T, a lightweight Transformer using Swish-Gated Linear Units to reduce computational complexity.

Result: The proposed approach achieves over 40% better accuracy than larger Transformer and CNN baselines while using significantly fewer FLOPs and training samples, validated on both simulated and real-world datasets.

Conclusion: The combination of SST tokenization and L-SwiGLU-T Transformer makes Transformer models efficient and suitable for resource-constrained indoor localization, addressing both computational burden and dataset dependency challenges.

Abstract: Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to poor accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU-T) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. Experimental results on simulated and real-world datasets demonstrate that SST and L-SwiGLU-T achieve substantial accuracy and efficiency gains, outperforming larger Transformer and CNN baselines by over 40% while using significantly fewer FLOPs and training samples.

[285] Advancing Generative Artificial Intelligence and Large Language Models for Demand Side Management with Internet of Electric Vehicles

Hanwen Zhang, Ruichen Zhang, Wei Zhang, Dusit Niyato, Yonggang Wen, Chunyan Miao

Main category: cs.LG

TL;DR: LLMs enhance IoT-enabled microgrid energy optimization and demand side management, particularly for electric vehicle charging scheduling, using retrieval-augmented generation for automated problem formulation and code generation.

DetailsMotivation: To transform energy optimization and demand side management in IoT-enabled microgrids using generative AI/LLMs, addressing challenges in automating DSM strategies with a focus on Internet of Electric Vehicles as a representative IoV application.

Method: Integration of LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization strategies for DSM in IoT-enabled microgrids.

Result: Demonstrated effectiveness in charging scheduling and optimization for electric vehicles, showing significant advancements in energy efficiency and user adaptability.

Conclusion: LLMs have strong potential for energy optimization in IoT-enabled microgrids and can promote intelligent DSM solutions, particularly for electric vehicle integration.

Abstract: The energy optimization and demand side management (DSM) of Internet of Things (IoT)-enabled microgrids are being transformed by generative artificial intelligence, such as large language models (LLMs). This paper explores the integration of LLMs into energy management, and emphasizes their roles in automating the optimization of DSM strategies with Internet of Electric Vehicles (IoEV) as a representative example of the Internet of Vehicles (IoV). We investigate challenges and solutions associated with DSM and explore the new opportunities presented by leveraging LLMs. Then, we propose an innovative solution that enhances LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization. The results demonstrate the effectiveness of our proposed solution in charging scheduling and optimization for electric vehicles, and highlight our solution’s significant advancements in energy efficiency and user adaptability. This work shows LLMs’ potential in energy optimization of the IoT-enabled microgrids and promotes intelligent DSM solutions.

[286] Accelerating Training Speed of Tiny Recursive Models with Curriculum Guided Adaptive Recursion

Kaleem Ullah Qasim, Jiashu Zhang

Main category: cs.LG

TL;DR: CGAR introduces curriculum learning to recursive reasoning models, dynamically adjusting recursion depth and supervision weighting to achieve 1.71x training speedup with minimal accuracy loss.

DetailsMotivation: Existing recursive reasoning models suffer from expensive training (36 GPU-hours for Sudoku extreme), fixed recursion depth, and uniform supervision weighting, leading to inefficient training.

Method: Progressive Depth Curriculum (PDC) dynamically adjusts recursion depth through three-stage schedule, and Hierarchical Supervision Weighting (HSW) applies exponential decay to supervision steps.

Result: CGAR achieves 1.71x training speedup (10.93 to 6.38 hours) with only 0.63% accuracy drop on Sudoku-Extreme, plus 100% halting accuracy and 11% fewer reasoning steps at inference.

Conclusion: CGAR enables efficient training of recursive models on modest hardware by treating depth as a scheduled parameter, making these models practical for neurosymbolic AI and program synthesis.

Abstract: Background: Recursive reasoning models achieve strong performance through iterative refinement, allowing small networks to match large language models. However, training is computationally expensive, often requiring 36 GPU-hours for Sudoku extreme. Existing models use fixed recursion depth and uniform supervision weighting, leading to inefficient training. Objectives: We propose CGAR (Curriculum-Guided Adaptive Recursion), applying curriculum learning to architectural depth. CGAR introduces Progressive Depth Curriculum (PDC) to dynamically adjust recursion depth and Hierarchical Supervision Weighting (HSW) to apply exponentially decaying importance to supervision steps. Methods: PDC implements a three-stage schedule transitioning from shallow (2, 1) to full depth (6, 3) configurations, providing 41.4% FLOPs reduction. HSW applies exponential decay to supervision steps, achieving 40% gradient variance reduction and accelerated convergence. Results: On Sudoku-Extreme, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours) with only a 0.63% accuracy drop (86.65% to 86.02%). PDC alone achieves 2.26x speedup with 85.47% accuracy, showing a Pareto improvement in efficiency and quality. HSW provides 1.61x speedup. CGAR-trained models show superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Conclusions: CGAR enables efficient training of recursive models on modest hardware. By treating depth as a scheduled parameter, it achieves substantial savings and prevents overfitting, making these models practical for neurosymbolic AI and program synthesis. https://github.com/Kaleemullahqasim/CGAR and huggingface.co/Kaleemullah/trm-cgar-sudoku.

[287] HopCast: Calibration of Autoregressive Dynamics Models

Muhammad Bilal Shahid, Cody Fleming

Main category: cs.LG

TL;DR: HOP introduces a Predictor-Corrector approach using Modern Hopfield Networks to learn errors of deterministic predictors for dynamical systems, producing sharper and better-calibrated multi-step prediction intervals compared to baselines.

DetailsMotivation: Deep learning models trained for dynamical systems often produce calibrated one-step predictions but need sound uncertainty propagation for multi-step autoregressive predictions. Existing methods lack proper uncertainty propagation during inference.

Method: HOP uses a Predictor-Corrector approach: a deterministic Predictor approximates the dynamical system, while a Corrector using Modern Hopfield Networks learns to predict errors based on context states during autoregression, creating error sets for uncertainty quantification.

Result: The method produces sharper and well-calibrated prediction intervals with higher predictive accuracy compared to baselines without uncertainty propagation. It’s the first to benchmark existing uncertainty propagation methods based on calibration errors.

Conclusion: HOP provides an effective uncertainty propagation method for multi-step predictions in dynamical systems, improving calibration and predictive accuracy while establishing a new benchmark for uncertainty propagation evaluation.

Abstract: Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. Many of these models are optimized to predict one step ahead; such approaches produce calibrated one-step predictions if the predictive model can quantify uncertainty, such as Deep Ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method to produce calibrated multi-step predictions. This work introduces an alternative Predictor-Corrector approach named \hop{} that uses Modern Hopfield Networks (MHN) to learn the errors of a deterministic Predictor that approximates the dynamical system. The Corrector predicts a set of errors for the Predictor’s output based on a context state at any timestep during autoregression. The set of errors creates sharper and well-calibrated prediction intervals with higher predictive accuracy compared to baselines without uncertainty propagation. The calibration and prediction performances are evaluated across a set of dynamical systems. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors.

[288] Fast Adaptive Anti-Jamming Channel Access via Deep Q Learning and Coarse-Grained Spectrum Prediction

Jianshu Zhang, Xiaofu Wu, Junquan Hu

Main category: cs.LG

TL;DR: Fast adaptive anti-jamming channel access approach using coarse-grained spectrum prediction as auxiliary task for DQN, reducing training episodes by 70% and improving throughput by 10% over Nash equilibrium strategies.

DetailsMotivation: Traditional fixed-pattern channel hopping is ineffective against dynamic jamming attacks, and while DRL-based approaches can achieve Nash equilibrium, they require extensive training episodes. Need for faster adaptation to dynamic jamming environments.

Method: Proposes a fast adaptive anti-jamming channel access approach guided by “learning faster than the jammer” intuition. Uses synchronously updated coarse-grained spectrum prediction as an auxiliary task for DQN-based anti-jamming model to help identify superior Q-function.

Result: Significantly accelerates convergence rate in model training, reducing required training episodes by up to 70% compared to standard DRL. Achieves 10% improvement in throughput over Nash equilibrium strategies through effective use of coarse-grained spectrum prediction.

Conclusion: The proposed approach effectively addresses the slow training problem in DRL-based anti-jamming methods while maintaining superior performance against dynamic jamming attacks through auxiliary spectrum prediction tasks.

Abstract: This paper investigates the anti-jamming channel access problem in complex and unknown jamming environments, where the jammer could dynamically adjust its strategies to target different channels. Traditional channel hopping anti-jamming approaches using fixed patterns are ineffective against such dynamic jamming attacks. Although the emerging deep reinforcement learning (DRL) based dynamic channel access approach could achieve the Nash equilibrium (NE) under fast-changing jamming attacks, it requires extensive training episodes. To address this issue, we propose a fast adaptive anti-jamming channel access approach guided by the intuition of ``learning faster than the jammer", where a synchronously updated coarse-grained spectrum prediction serves as an auxiliary task for the deep Q network (DQN) based anti-jamming model. This helps the model identify a superior Q-function compared to standard DRL while significantly reducing the number of training episodes. Numerical results indicate that the proposed approach significantly accelerates the rate of convergence in model training, reducing the required training episodes by up to 70% compared to standard DRL. Additionally, it also achieves a 10% improvement in throughput over NE strategies, owing to the effective use of coarse-grained spectrum prediction.

[289] Bias-variance decompositions: the exclusive privilege of Bregman divergences

Tom Heskes

Main category: cs.LG

TL;DR: The paper proves that only g-Bregman divergences (which can be transformed into standard Bregman divergences) admit clean bias-variance decompositions, explaining why common metrics like 0-1 and L1 losses fail to have such decompositions.

DetailsMotivation: Bias-variance decompositions are important for understanding model generalization, but most loss functions beyond squared error either don't sum bias and variance to expected loss or lack meaningful properties. While recent work showed Bregman divergences allow clean decompositions, the necessary and sufficient conditions remained unknown.

Method: Study continuous, nonnegative loss functions satisfying identity of indiscernibles under mild regularity conditions. Prove that g-Bregman (rho-tau) divergences are the only such loss functions with clean bias-variance decomposition. Show these can be transformed into standard Bregman divergences via invertible variable changes.

Result: Only g-Bregman divergences admit clean bias-variance decompositions. Squared Mahalanobis distance (up to variable transformation) is the only symmetric loss function with such decomposition. Common metrics like 0-1 and L1 losses cannot have clean decompositions, explaining previous failures.

Conclusion: The paper provides a complete characterization of loss functions permitting clean bias-variance decompositions, resolving an open question about necessary and sufficient conditions and explaining limitations of previous decomposition attempts for various loss functions.

Abstract: Bias-variance decompositions are widely used to understand the generalization performance of machine learning models. While the squared error loss permits a straightforward decomposition, other loss functions - such as zero-one loss or $L_1$ loss - either fail to sum bias and variance to the expected loss or rely on definitions that lack the essential properties of meaningful bias and variance. Recent research has shown that clean decompositions can be achieved for the broader class of Bregman divergences, with the cross-entropy loss as a special case. However, the necessary and sufficient conditions for these decompositions remain an open question. In this paper, we address this question by studying continuous, nonnegative loss functions that satisfy the identity of indiscernibles (zero loss if and only if the two arguments are identical), under mild regularity conditions. We prove that so-called $g$-Bregman or rho-tau divergences are the only such loss functions that have a clean bias-variance decomposition. A $g$-Bregman divergence can be transformed into a standard Bregman divergence through an invertible change of variables. This makes the squared Mahalanobis distance, up to such a variable transformation, the only symmetric loss function with a clean bias-variance decomposition. Consequently, common metrics such as $0$-$1$ and $L_1$ losses cannot admit a clean bias-variance decomposition, explaining why previous attempts have failed. We also examine the impact of relaxing the restrictions on the loss functions and how this affects our results.

[290] Communication-Efficient and Differentially Private Vertical Federated Learning with Zeroth-Order Optimization

Jianing Zhang, Evan Chen, Dong-Jun Han, Chaoyue Liu, Christopher G. Brinton

Main category: cs.LG

TL;DR: DPZV: A communication-efficient differentially private zeroth-order VFL framework with tunable privacy guarantees that reduces variance amplification while providing strong privacy protection.

DetailsMotivation: Vertical Federated Learning (VFL) suffers from significant communication overhead and privacy risks due to device-server information exchange. Downlink communication exposes gradient-related signals that can be exploited in inference attacks. Existing DP-based VFL approaches degrade gradient quality, slow convergence, and require excessive communication rounds.

Method: DPZV uses zeroth-order (ZO) optimization and injects calibrated scalar-valued differential privacy noise on the downlink. This approach significantly reduces variance amplification while providing equivalent protection against targeted inference attacks. The framework establishes tunable privacy guarantees through $(ε, δ)$-DP.

Result: DPZV achieves superior privacy-utility tradeoff and requires fewer communication rounds than existing DP-VFL baselines under strict privacy constraints ($ε≤10$). Theoretical analysis shows convergence guarantees comparable to first-order DP-SGD despite relying solely on ZO estimators.

Conclusion: DPZV provides a communication-efficient and differentially private VFL solution that addresses both privacy risks and communication overhead while maintaining competitive convergence performance compared to first-order methods.

Abstract: Vertical Federated Learning (VFL) enables collaborative model training across feature-partitioned devices, yet its reliance on device-server information exchange introduces significant communication overhead and privacy risks. Downlink communication from the server to devices in VFL exposes gradient-related signals of the global loss that can be leveraged in inference attacks. Existing privacy-preserving VFL approaches that inject differential privacy (DP) noise on the downlink have the natural repercussion of degraded gradient quality, slowed convergence, and excessive communication rounds. In this work, we propose DPZV, a communication-efficient and differentially private ZO-VFL framework with tunable privacy guarantees. Based on zeroth-order (ZO) optimization, DPZV injects calibrated scalar-valued DP noise on the downlink, significantly reducing variance amplification while providing equivalent protection against targeted inference attacks. Through rigorous theoretical analysis, we establish convergence guarantees comparable to first-order DP-SGD, despite relying solely on ZO estimators, and prove that DPZV satisfies $(ε, δ)$-DP. Extensive experiments demonstrate that DPZV consistently achieves a superior privacy-utility tradeoff and requires fewer communication rounds than existing DP-VFL baselines under strict privacy constraints ($ε\leq 10$).

[291] Sparse Hyperparametric Itakura-Saito Nonnegative Matrix Factorization via Bi-Level Optimization

Laura Selicato, Flavia Esposito, Andersen Ang, Nicoletta Del Buono, Rafal Zdunek

Main category: cs.LG

TL;DR: SHINBO: A bi-level optimization algorithm for automatic hyperparameter tuning in IS-NMF, improving sparse periodic signal extraction in noisy environments.

DetailsMotivation: Penalty hyperparameter selection is critical in NMF as it controls the trade-off between reconstruction accuracy and constraint adherence. Manual tuning is difficult, especially for Itakura-Saito divergence NMF which is effective for extracting low spectral density components but requires proper sparsity constraints.

Method: SHINBO introduces a bi-level optimization framework that automatically and adaptively tunes row-dependent penalty hyperparameters for IS-NMF. This enhances the algorithm’s ability to isolate sparse, periodic signals in noisy environments.

Result: SHINBO achieves accurate spectral decompositions and demonstrates superior performance in both synthetic and real-world applications. It’s particularly effective for noninvasive vibration-based fault detection in rolling bearings where desired signals are obscured by stronger, spectrally broader noise.

Conclusion: By addressing the critical issue of hyperparameter selection, SHINBO improves the state-of-the-art in signal recovery for complex, noise-dominated environments, offering an automated solution for sparse signal extraction in challenging acoustic applications.

Abstract: The selection of penalty hyperparameters is a critical aspect in Nonnegative Matrix Factorization (NMF), since these values control the trade-off between reconstruction accuracy and adherence to desired constraints. In this work, we focus on an NMF problem involving the Itakura-Saito (IS) divergence, which is particularly effective for extracting low spectral density components from spectrograms of mixed signals, and benefits from the introduction of sparsity constraints. We propose a new algorithm called SHINBO, which introduces a bi-level optimization framework to automatically and adaptively tune the row-dependent penalty hyperparameters, enhancing the ability of IS-NMF to isolate sparse, periodic signals in noisy environments. Experimental results demonstrate that SHINBO achieves accurate spectral decompositions and demonstrates superior performance in both synthetic and real-world applications. In the latter case, SHINBO is particularly useful for noninvasive vibration-based fault detection in rolling bearings, where the desired signal components often reside in high-frequency subbands but are obscured by stronger, spectrally broader noise. By addressing the critical issue of hyperparameter selection, SHINBO improves the state-of-the-art in signal recovery for complex, noise-dominated environments.

[292] When Unsupervised Domain Adaptation meets One-class Anomaly Detection: Addressing the Two-fold Unsupervised Curse by Leveraging Anomaly Scarcity

Nesryne Mejri, Enjie Ghorbel, Anis Kacem, Pavel Chernakov, Niki Foteinopoulou, Djamila Aouada

Main category: cs.LG

TL;DR: First fully unsupervised domain adaptation framework for anomaly detection that addresses the “two-fold unsupervised curse” by aligning source normal features with the dominant cluster in target data.

DetailsMotivation: Anomaly detection performance degrades with domain shifts in real-world settings. While UDA works for classification, it's ill-posed for anomaly detection due to the unsupervised nature of both tasks (two-fold unsupervised curse).

Method: Assumes anomalies are rare, uses clustering to identify dominant normal cluster in target data, aligns source normal features with this cluster via hypersphere fitting. Works with one-class source set and unlabeled target set containing mostly normal data.

Result: Extensive experiments on common adaptation benchmarks demonstrate the relevance of the new paradigm and effectiveness of the proposed approach.

Conclusion: Pioneering solution to the previously intractable two-fold unsupervised curse in anomaly detection domain adaptation, with code to be made publicly available.

Abstract: This paper introduces the first fully unsupervised domain adaptation (UDA) framework for unsupervised anomaly detection (UAD). The performance of UAD techniques degrades significantly in the presence of a domain shift, difficult to avoid in a real-world setting. While UDA has contributed to solving this issue in binary and multi-class classification, such a strategy is ill-posed in UAD. This might be explained by the unsupervised nature of the two tasks, namely, domain adaptation and anomaly detection. Herein, we first formulate this problem that we call the two-fold unsupervised curse. Then, we propose a pioneering solution to this curse, considered intractable so far, by assuming that anomalies are rare. Specifically, we leverage clustering techniques to identify a dominant cluster in the target feature space. Posed as the normal cluster, the latter is aligned with the source normal features. Concretely, given a one-class source set and an unlabeled target set composed mostly of normal data and some anomalies, we fit the source features within a hypersphere while jointly aligning them with the features of the dominant cluster from the target set. The paper provides extensive experiments and analysis on common adaptation benchmarks for anomaly detection, demonstrating the relevance of both the newly introduced paradigm and the proposed approach. The code will be made publicly available.

[293] GLADMamba: Unsupervised Graph-Level Anomaly Detection Powered by Selective State Space Model

Yali Fu, Jindong Li, Qi Wang, Qianli Xing

Main category: cs.LG

TL;DR: GLADMamba introduces selective state space models (Mamba) to unsupervised graph-level anomaly detection, using View-Fused Mamba and Spectrum-Guided Mamba modules to capture long-range dependencies and spectral information with linear complexity.

DetailsMotivation: Existing unsupervised graph-level anomaly detection methods struggle with capturing long-range dependencies efficiently and neglect spectral information, which is crucial for domains like social network analysis, drug discovery, and toxic molecule identification.

Method: Proposes GLADMamba framework with two key modules: 1) View-Fused Mamba (VFM) with Mamba-Transformer architecture to fuse information from different graph views using selective state mechanism, and 2) Spectrum-Guided Mamba (SGM) using Rayleigh quotient to guide embedding refinement with spectral information.

Result: Extensive experiments on 12 real-world datasets show GLADMamba outperforms existing state-of-the-art methods, achieving superior performance in unsupervised graph-level anomaly detection.

Conclusion: GLADMamba successfully adapts selective state space models to graph anomaly detection, being the first to introduce Mamba and explicit spectral information to this field, demonstrating effectiveness in capturing long-range dependencies and spectral information with linear complexity.

Abstract: Unsupervised graph-level anomaly detection (UGLAD) is a critical and challenging task across various domains, such as social network analysis, anti-cancer drug discovery, and toxic molecule identification. However, existing methods often struggle to capture long-range dependencies efficiently and neglect the spectral information. Recently, selective state space models, particularly Mamba, have demonstrated remarkable advantages in capturing long-range dependencies with linear complexity and a selection mechanism. Motivated by their success across various domains, we propose GLADMamba, a novel framework that adapts the selective state space model into UGLAD field. We design a View-Fused Mamba (VFM) module with a Mamba-Transformer-style architecture to efficiently fuse information from different graph views with a selective state mechanism. We also design a Spectrum-Guided Mamba (SGM) module with a Mamba-Transformer-style architecture to leverage the Rayleigh quotient to guide the embedding refinement process, considering the spectral information for UGLAD. GLADMamba can dynamically focus on anomaly-related information while discarding irrelevant information for anomaly detection. To the best of our knowledge, this is the first work to introduce Mamba and explicit spectral information to UGLAD. Extensive experiments on 12 real-world datasets demonstrate that GLADMamba outperforms existing state-of-the-art methods, achieving superior performance in UGLAD. The code is available at https://github.com/Yali-Fu/GLADMamba.

[294] Clustering with Communication: A Variational Framework for Single Cell Representation Learning

Cong Qi, Yeqing Chen, Zhi Wei

Main category: cs.LG

TL;DR: CCCVAE is a variational autoencoder that incorporates cell-cell communication signals into single-cell representation learning, improving clustering performance over standard VAEs.

DetailsMotivation: While scRNA-seq reveals cellular heterogeneity, understanding biological function requires modeling cell-cell communication (CCC) mediated by ligand-receptor pairs. Current tools show CCC is critical for processes like differentiation and immune response, and transcriptomic data inherently contains intercellular signaling information.

Method: CCCVAE is a novel variational autoencoder framework that incorporates CCC signals using a communication-aware kernel derived from ligand-receptor interactions and a sparse Gaussian process. Unlike conventional VAEs that treat cells independently, CCCVAE encodes biologically informed priors into the latent space to reflect both transcriptional similarity and intercellular signaling context.

Result: Empirical results across four scRNA-seq datasets show CCCVAE improves clustering performance, achieving higher evaluation scores than standard VAE baselines.

Conclusion: This work demonstrates the value of embedding biological priors (specifically CCC signals) into deep generative models for unsupervised single-cell analysis, moving beyond independent cell representations to incorporate intercellular signaling context.

Abstract: Single-cell RNA sequencing (scRNA-seq) has revealed complex cellular heterogeneity, but recent studies emphasize that understanding biological function also requires modeling cell-cell communication (CCC), the signaling interactions mediated by ligand-receptor pairs that coordinate cellular behavior. Tools like CellChat have demonstrated that CCC plays a critical role in processes such as cell differentiation, tissue regeneration, and immune response, and that transcriptomic data inherently encodes rich information about intercellular signaling. We propose CCCVAE, a novel variational autoencoder framework that incorporates CCC signals into single-cell representation learning. By leveraging a communication-aware kernel derived from ligand-receptor interactions and a sparse Gaussian process, CCCVAE encodes biologically informed priors into the latent space. Unlike conventional VAEs that treat each cell independently, CCCVAE encourages latent embeddings to reflect both transcriptional similarity and intercellular signaling context. Empirical results across four scRNA-seq datasets show that CCCVAE improves clustering performance, achieving higher evaluation scores than standard VAE baselines. This work demonstrates the value of embedding biological priors into deep generative models for unsupervised single-cell analysis.

[295] Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation

Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider

Main category: cs.LG

TL;DR: A novel consistency distillation method for offline RL that achieves single-step sampling with higher reward trajectories and 142x speedup over diffusion models.

DetailsMotivation: Diffusion models in decision-making have slow inference speeds, and existing consistency model approaches either suffer from suboptimal demonstrations under behavior cloning or require complex multi-network training in actor-critic frameworks.

Method: Proposes consistency distillation for offline RL that directly incorporates reward optimization into distillation process, using decoupled training and noise-free reward signals to achieve single-step sampling.

Result: Achieves 9.7% improvement over previous state-of-the-art and up to 142x speedup over diffusion counterparts in inference time on Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks.

Conclusion: The method successfully addresses the inference speed limitation of diffusion models in decision-making while improving performance through reward-optimized consistency distillation.

Abstract: Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.

[296] A new machine learning framework for occupational accidents forecasting with safety inspections integration

Aho Yapi, Pierre Latouche, Arnaud Guillin, Yan Bailly

Main category: cs.LG

TL;DR: A framework for short-term occupational accident forecasting using safety inspection data modeled as binary time series, with LSTM networks achieving best performance (0.86 balanced accuracy) for detecting high-risk periods.

DetailsMotivation: To convert routine safety inspection data into actionable weekly risk scores that help prevent occupational accidents before they occur, enabling proactive safety management and better resource allocation.

Method: Models accident occurrences as binary time series, uses sliding-window cross-validation for time series data, compares multiple ML algorithms (logistic regression, tree-based models, neural networks), and aggregates daily predictions into weekly safety assessments.

Result: LSTM networks outperform other approaches with 0.86 balanced accuracy in detecting upcoming high-risk periods, demonstrating that binary time series models can effectively anticipate critical safety periods based on inspection data.

Conclusion: The methodology successfully converts safety inspection data into clear weekly risk scores that decision-makers can integrate into planning tools to prioritize inspections, schedule interventions, and allocate resources to highest-risk sites/shifts before incidents occur.

Abstract: We propose a generic framework for short-term occupational accident forecasting that leverages safety inspections and models accident occurrences as binary time series. The approach generates daily predictions, which are then aggregated into weekly safety assessments to better inform decision making. To ensure the reliability and operational applicability of the forecasts, we apply a sliding-window cross-validation procedure specifically designed for time series data, combined with an evaluation based on aggregated period-level metrics. Several machine learning algorithms, including logistic regression, tree-based models, and neural networks, are trained and systematically compared within this framework. Unlike the other approaches, the long short-term memory (LSTM) network outperforms the other approaches and detects the upcoming high-risk periods with a balanced accuracy of 0.86, confirming the robustness of our methodology and demonstrating that a binary time series model can anticipate these critical periods based on safety inspections. The proposed methodology converts routine safety inspection data into clear weekly risk scores, detecting the periods when accidents are most likely. Decision-makers can integrate these scores into their planning tools to classify inspection priorities, schedule targeted interventions, and funnel resources to the sites or shifts classified as highest risk, stepping in before incidents occur and getting the greatest return on safety investments.

[297] The Primacy of Magnitude in Low-Rank Adaptation

Zicheng Zhang, Haoran Li, Yifeng Zhang, Guoqiang Gong, Jiaxing Wang, Junxing Hu, Pengzhang Liu, Qixia Jiang

Main category: cs.LG

TL;DR: LoRAM: A magnitude-driven initialization scheme for LoRA that matches spectral methods’ performance without their computational overhead by using pretrained weight magnitudes to scale orthogonal bases.

DetailsMotivation: Spectral initialization methods for LoRA improve convergence and performance but introduce extra computational and storage overhead that undermines LoRA's parameter-efficient paradigm. The paper aims to understand the fundamental driver of LoRA performance and develop an efficient initialization scheme.

Method: Proposes LoRAM, a “Basis & Basis” initialization scheme that scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. The method is based on the insight that update magnitude is the fundamental driver of LoRA performance, and spectral initialization works by amplifying update magnitude.

Result: Extensive experiments show LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks. The method achieves spectral-like performance without the computational overhead.

Conclusion: Update magnitude is the key factor determining LoRA convergence and performance. LoRAM provides an efficient initialization strategy that leverages pretrained weight magnitudes to achieve spectral method benefits while maintaining LoRA’s parameter efficiency.

Abstract: Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive “Noise & Zeros” scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven “Basis & Basis” initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of the spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.

[298] PhysicsCorrect: A Training-Free Approach for Stable Neural PDE Simulations

Xinquan Huang, Paris Perdikaris

Main category: cs.LG

TL;DR: PhysicsCorrect is a training-free correction framework that prevents error accumulation in neural PDE solvers by enforcing PDE consistency at each prediction step through linearized inverse problems, achieving 100x error reduction with minimal inference overhead.

DetailsMotivation: Neural network PDE surrogates suffer from error accumulation during long-term rollouts, where small inaccuracies compound exponentially, causing divergence from physically valid solutions. This limits their practical reliability despite computational advantages.

Method: PhysicsCorrect formulates correction as a linearized inverse problem based on PDE residuals. It uses an efficient caching strategy that precomputes the Jacobian and its pseudoinverse during an offline warm-up phase, reducing computational overhead by two orders of magnitude compared to standard correction approaches.

Result: Across three representative PDE systems (Navier-Stokes, wave equations, Kuramoto-Sivashinsky), PhysicsCorrect reduces prediction errors by up to 100x while adding negligible inference time (under 5%). The framework integrates seamlessly with diverse architectures including Fourier Neural Operators, UNets, and Vision Transformers.

Conclusion: PhysicsCorrect effectively transforms unstable neural surrogates into reliable simulation tools, bridging the gap between deep learning’s computational efficiency and the physical fidelity demanded by practical scientific applications.

Abstract: Neural networks have emerged as powerful surrogates for solving partial differential equations (PDEs), offering significant computational speedups over traditional methods. However, these models suffer from a critical limitation: error accumulation during long-term rollouts, where small inaccuracies compound exponentially, eventually causing complete divergence from physically valid solutions. We present PhysicsCorrect, a training-free correction framework that enforces PDE consistency at each prediction step by formulating correction as a linearized inverse problem based on PDE residuals. Our key innovation is an efficient caching strategy that precomputes the Jacobian and its pseudoinverse during an offline warm-up phase, reducing computational overhead by two orders of magnitude compared to standard correction approaches. Across three representative PDE systems, including Navier-Stokes fluid dynamics, wave equations, and the chaotic Kuramoto-Sivashinsky equation, PhysicsCorrect reduces prediction errors by up to 100x while adding negligible inference time (under 5%). The framework integrates seamlessly with diverse architectures, including Fourier Neural Operators, UNets, and Vision Transformers, effectively transforming unstable neural surrogates into reliable simulation tools that bridge the gap between deep learning’s computational efficiency and the physical fidelity demanded by practical scientific applications.

[299] Efficient Neural Combinatorial Optimization Solver for the Min-max Heterogeneous Capacitated Vehicle Routing Problem

Xuan Wu, Di Wang, Chunguo Wu, Kaifang Qi, Chunyan Miao, Yubin Xiao, Jian Zhang, You Zhou

Main category: cs.LG

TL;DR: ECHO is a Neural Combinatorial Optimization solver for min-max Heterogeneous Capacitated Vehicle Routing Problem that addresses limitations of existing methods through dual-modality node encoding, parameter-free cross-attention, and tailored data augmentation.

DetailsMotivation: Most existing Neural Combinatorial Optimization solvers focus on single-vehicle VRP variants and overlook the more realistic min-max Heterogeneous Capacitated Vehicle Routing Problem (MMHCVRP) with multiple vehicles. Current MMHCVRP solvers make myopic decoding decisions and fail to capture key properties like local topological relationships, vehicle permutation invariance, and node symmetry.

Method: 1) Dual-modality node encoder to capture local topological relationships among nodes. 2) Parameter-Free Cross-Attention mechanism to prioritize vehicles selected in previous decoding steps and mitigate myopic decisions. 3) Tailored data augmentation strategy leveraging vehicle permutation invariance and node symmetry to stabilize Reinforcement Learning training.

Result: ECHO outperforms state-of-the-art NCO solvers across varying numbers of vehicles and nodes, and exhibits strong generalization across both scales and distribution patterns. Ablation studies validate the effectiveness of all proposed methods.

Conclusion: ECHO provides an efficient NCO solver for MMHCVRP that addresses key limitations of existing approaches through novel architectural components and training strategies, demonstrating superior performance and generalization capabilities.

Abstract: Numerous Neural Combinatorial Optimization (NCO) solvers have been proposed to address Vehicle Routing Problems (VRPs). However, most of these solvers focus exclusively on single-vehicle VRP variants, overlooking the more realistic min-max Heterogeneous Capacitated Vehicle Routing Problem (MMHCVRP), which involves multiple vehicles. Existing MMHCVRP solvers typically select a vehicle and its next node to visit at each decoding step, but often make myopic decoding decisions and overlook key properties of MMHCVRP, including local topological relationships, vehicle permutation invariance, and node symmetry, resulting in suboptimal performance. To better address these limitations, we propose ECHO, an efficient NCO solver. First, ECHO exploits the proposed dual-modality node encoder to capture local topological relationships among nodes. Subsequently, to mitigate myopic decisions, ECHO employs the proposed Parameter-Free Cross-Attention mechanism to prioritize the vehicle selected in the preceding decoding step. Finally, leveraging vehicle permutation invariance and node symmetry, we introduce a tailored data augment strategy for MMHCVRP to stabilize the Reinforcement Learning training process. To assess the performance of ECHO, we conduct extensive experiments. The experimental results demonstrate that ECHO outperforms state-of-the-art NCO solvers across varying numbers of vehicles and nodes, and exhibits well-performing generalization across both scales and distribution patterns. Finally, ablation studies validate the effectiveness of all proposed methods.

[300] Cost-aware Stopping for Bayesian Optimization

Qian Xie, Linda Cai, Alexander Terenin, Peter I. Frazier, Ziv Scully

Main category: cs.LG

TL;DR: Proposed a principled cost-aware stopping rule for Bayesian optimization with theoretical guarantees on cost-adjusted simple regret, outperforming heuristic approaches.

DetailsMotivation: Current Bayesian optimization stopping rules are heuristic and lack guarantees for cost-aware scenarios where expensive black-box function evaluations need to be balanced with solution quality.

Method: Developed a cost-aware stopping rule grounded in theoretical connection to state-of-the-art cost-aware acquisition functions (Pandora’s Box Gittins Index and log expected improvement per cost), with theoretical guarantees bounding expected cost-adjusted simple regret.

Result: The proposed stopping rule paired with PBGI or LogEIPC matches or outperforms other acquisition-function–stopping-rule pairs across synthetic and empirical tasks including hyperparameter optimization and neural architecture search.

Conclusion: Provides a principled, theoretically-grounded stopping rule for Bayesian optimization that adapts to varying evaluation costs without heuristic tuning, addressing an important practical consideration in automated ML and scientific discovery.

Abstract: In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions in a cost-aware manner is an important but underexplored practical consideration. A natural performance metric for this purpose is the cost-adjusted simple regret, which captures the trade-off between solution quality and cumulative evaluation cost. While several heuristic or adaptive stopping rules have been proposed, they lack guarantees ensuring stopping before incurring excessive function evaluation costs. We propose a principled cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs without heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora’s Box Gittins Index (PBGI) and log expected improvement per cost (LogEIPC). We prove a theoretical guarantee bounding the expected cost-adjusted simple regret incurred by our stopping rule when paired with either acquisition function. Across synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, pairing our stopping rule with PBGI or LogEIPC usually matches or outperforms other acquisition-function–stopping-rule pairs in terms of cost-adjusted simple regret.

[301] Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning

Aditya Sharma, Ananya Gupta, Chengyu Wang, Chiamaka Adebayo, Jakub Kowalski

Main category: cs.LG

TL;DR: CWMI framework embeds causal physics understanding in LLMs using a dedicated Causal Physics Module and Causal Intervention Loss, enabling better physical reasoning than standard LLMs.

DetailsMotivation: LLMs lack intuitive understanding of physical dynamics, limiting their effectiveness in real-world scenarios requiring causal reasoning. Current models rely on statistical correlations rather than true causal understanding.

Method: Introduces Causal World Model Induction (CWMI) framework with a dedicated Causal Physics Module (CPM) and Causal Intervention Loss training objective. The model learns cause-and-effect relationships from multimodal data by predicting outcomes of hypothetical interventions rather than just capturing correlations.

Result: CWMI significantly outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks, including PIQA benchmark and the newly proposed PhysiCa-Bench dataset.

Conclusion: Inducing a causal world model is a critical step toward more reliable and generalizable AI systems, moving beyond statistical correlations to true causal understanding of physical laws.

Abstract: Large Language Models (LLMs), despite their advanced linguistic capabilities, fundamentally lack an intuitive understanding of physical dynamics, which limits their effectiveness in real-world scenarios that require causal reasoning. In this paper, we introduce Causal World Model Induction (CWMI), a novel framework designed to embed an explicit model of causal physics within an LLM. Our approach incorporates a dedicated Causal Physics Module (CPM) and a new training objective called Causal Intervention Loss, encouraging the model to learn cause-and-effect relationships from multimodal data. By training the model to predict the outcomes of hypothetical interventions instead of merely capturing statistical correlations, CWMI develops a robust internal representation of physical laws. Experimental results show that CWMI significantly outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks, including the PIQA benchmark and our newly proposed PhysiCa-Bench dataset. These findings demonstrate that inducing a causal world model is a critical step toward more reliable and generalizable AI systems.

[302] Beyond Trade-offs: A Unified Framework for Privacy, Robustness, and Communication Efficiency in Federated Learning

Yue Xia, Tayyebeh Jahani-Nezhad, Rawad Bitar

Main category: cs.LG

TL;DR: Fed-DPRoC is a federated learning framework that simultaneously provides differential privacy, Byzantine robustness, and communication efficiency through robust-compatible compression.

DetailsMotivation: Existing federated learning systems struggle to simultaneously achieve privacy, robustness against malicious clients (Byzantine attacks), and communication efficiency. There's a need for a unified framework that can address all three challenges without compromising on any aspect.

Method: The framework introduces robust-compatible compression that reduces communication overhead without undermining robustness. The specific instantiation called RobAJoL integrates Johnson-Lindenstrauss (JL)-based compression with robust averaging mechanisms to maintain Byzantine robustness while ensuring differential privacy.

Result: Theoretical analysis confirms JL transform compatibility with robust averaging, ensuring maintained robustness guarantees, DP satisfaction, and significant communication reduction. Empirical results on CIFAR-10, Fashion MNIST, and FEMNIST datasets show RobAJoL outperforms state-of-the-art methods in both robustness and utility under various Byzantine attacks.

Conclusion: Fed-DPRoC successfully demonstrates that robust-compatible compression can simultaneously achieve differential privacy, Byzantine robustness, and communication efficiency in federated learning, with RobAJoL showing superior performance compared to existing approaches.

Abstract: We propose Fed-DPRoC, a novel federated learning framework designed to jointly provide differential privacy (DP), Byzantine robustness, and communication efficiency. Central to our approach is the concept of robust-compatible compression, which allows reducing the bi-directional communication overhead without undermining the robustness of the aggregation. We instantiate our framework as RobAJoL, which integrates the Johnson-Lindenstrauss (JL)-based compression mechanism with robust averaging for robustness. Our theoretical analysis establishes the compatibility of JL transform with robust averaging, ensuring that RobAJoL maintains robustness guarantees, satisfies DP, and substantially reduces communication overhead. We further present simulation results on CIFAR-10, Fashion MNIST, and FEMNIST, validating our theoretical claims. We compare RobAJoL with a state-of-the-art communication-efficient and robust FL scheme augmented with DP for a fair comparison, demonstrating that RobAJoL outperforms existing methods in terms of robustness and utility under different Byzantine attacks.

[303] Surrogate Representation Inference for Text and Image Annotations

Kentaro Nakamura

Main category: cs.LG

TL;DR: SRI (Surrogate Representation Inference) is a new method that reduces standard errors by over 50% when using ML/LLM annotations for statistical analysis, even with measurement errors in human annotations.

DetailsMotivation: Existing methods for correcting bias when using ML/LLM annotations on unstructured data yield large standard errors and require error-free human annotations, which limits practical applications.

Method: Proposes SRI which assumes unstructured data fully mediate the relationship between human annotations and structured variables. Uses neural network architecture to learn low-dimensional representations satisfying the surrogate assumption. Extends to correct non-differential measurement errors when multiple human annotations are available.

Result: Simulation studies and real-world application show SRI reduces standard errors by over 50% when ML classification accuracy is moderate, and provides valid inference even with non-differential measurement errors in human annotations.

Conclusion: SRI offers a practical solution for leveraging ML/LLM annotations in statistical analysis with reduced standard errors and robustness to measurement errors, overcoming limitations of existing methods.

Abstract: As researchers increasingly rely on machine learning models and LLMs to annotate unstructured data, such as texts or images, various approaches have been proposed to correct bias in downstream statistical analysis. However, existing methods tend to yield large standard errors and require some error-free human annotation. In this paper, I introduce Surrogate Representation Inference (SRI), which assumes that unstructured data fully mediate the relationship between human annotations and structured variables. The assumption is guaranteed by design provided that human coders rely only on unstructured data for annotation. Under this setting, I propose a neural network architecture that learns a low-dimensional representation of unstructured data such that the surrogate assumption remains to be satisfied. When multiple human annotations are available, SRI can be extended to further correct non-differential measurement errors that may exist in human annotations. Focusing on text-as-outcome settings, I formally establish the identification conditions and semiparametric efficient estimation strategies that enable learning and leveraging such a low-dimensional representation. Simulation studies and a real-world application demonstrate that SRI reduces standard errors by over 50% when machine learning classification accuracy is moderate and provides valid inference even when human annotations contain non-differential measurement errors.

[304] Sparsity and Superposition in Mixture of Experts

Marmik Chaudhari, Jeremi Nuer, Rome Thorstenson

Main category: cs.LG

TL;DR: MoE models differ mechanistically from dense networks - network sparsity (not feature sparsity/importance) drives superposition patterns, with sparser networks showing more monosemantic (interpretable) experts.

DetailsMotivation: To understand the mechanistic differences between Mixture of Experts (MoE) models and dense networks, particularly how superposition works in MoEs and whether they can enable more interpretable models without sacrificing performance.

Method: Developed new metrics for measuring superposition across experts, analyzed how network sparsity (ratio of active to total experts) characterizes MoEs, and proposed a new definition of expert specialization based on monosemantic feature representation rather than load balancing.

Result: Found that network sparsity better characterizes MoEs than feature sparsity/importance, models with greater network sparsity exhibit greater monosemanticity, and experts naturally organize around coherent feature combinations with proper initialization.

Conclusion: Network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the assumption that interpretability and capability are fundamentally opposed.

Abstract: Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly understood. Previous work has explored how dense models use \textit{superposition} to represent more features than dimensions, and how superposition is a function of feature sparsity and feature importance. MoE models cannot be explained mechanistically through the same lens. We find that neither feature sparsity nor feature importance cause discontinuous phase changes, and that network sparsity (the ratio of active to total experts) better characterizes MoEs. We develop new metrics for measuring superposition across experts. Our findings demonstrate that models with greater network sparsity exhibit greater \emph{monosemanticity}. We propose a new definition of expert specialization based on monosemantic feature representation rather than load balancing, showing that experts naturally organize around coherent feature combinations when initialized appropriately. These results suggest that network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the common assumption that interpretability and capability are fundamentally at odds.

[305] A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu

Main category: cs.LG

TL;DR: This paper identifies a transition from generalization to memorization during model collapse in diffusion models, proposes entropy-based data selection to mitigate it, and shows improved visual quality/diversity.

DetailsMotivation: Widespread use of diffusion models leads to AI-generated data abundance, raising concerns about model collapse where recursive training on synthetic data causes performance degradation. Prior work misses practical manifestations by focusing only on variance shrinkage or distribution shift.

Method: Identifies transition from generalization to memorization during model collapse, driven by declining entropy of synthetic training data. Proposes entropy-based data selection strategy to mitigate this transition and alleviate model collapse.

Result: Empirical results show the approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

Conclusion: The transition from generalization to memorization is a key practical manifestation of model collapse in diffusion models, and entropy-based data selection effectively mitigates this degradation while maintaining visual quality and diversity.

Abstract: The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse – a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

[306] MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling

Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan

Main category: cs.LG

TL;DR: MISA is a module-wise importance sampling method for memory-efficient LLM optimization that divides layers into modules, assigns importance scores, and uses weighted random sampling to activate modules, reducing gradient variance and memory usage compared to layer-wise approaches.

DetailsMotivation: Current layer-wise optimization methods for LLMs have two main limitations: they ignore varying importance of modules within each layer (leading to suboptimal performance), and they provide limited memory savings since at least one full layer must remain active during optimization.

Method: MISA divides each transformer layer into smaller modules, assigns importance scores to each module, and uses a weighted random sampling mechanism to activate modules. This approach reduces gradient variance compared to layer-wise sampling and provides theoretical convergence guarantees.

Result: The paper establishes an O(1/√K) convergence rate under non-convex and stochastic conditions, provides detailed memory analysis showing MISA’s superiority over baselines, and validates effectiveness through experiments on diverse learning tasks.

Conclusion: MISA offers a more fine-grained and memory-efficient optimization approach for LLMs by addressing limitations of layer-wise methods through module-wise importance sampling, with theoretical guarantees and empirical validation.

Abstract: The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing gradient variance compared to layer-wise sampling. Additionally, we establish an (\mathcal{O}(1/\sqrt{K})) convergence rate under non-convex and stochastic conditions, where $K$ is the total number of block updates, and provide a detailed memory analysis showcasing MISA’s superiority over existing baseline methods. Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at https://github.com/pkumelon/MISA.

[307] Deterministic Discrete Denoising

Hideyuki Suzuki, Hiroshi Yamashita

Main category: cs.LG

TL;DR: Deterministic denoising algorithm for discrete diffusion models using Markov chains and herding algorithm with weakly chaotic dynamics, improving efficiency and sample quality without retraining.

DetailsMotivation: To create a deterministic alternative to stochastic denoising processes in discrete diffusion models, addressing the need for more efficient and higher-quality generation without requiring retraining or continuous embeddings.

Method: Proposes a deterministic algorithm based on Markov chains that derandomizes the generative reverse process using a variant of the herding algorithm with weakly chaotic dynamics, enabling deterministic discrete state transitions.

Result: Demonstrates consistent improvements in both efficiency and sample quality on text and image generation tasks, showing that deterministic reverse processes can be effective in discrete state spaces.

Conclusion: This simple derandomization approach enhances the significance of discrete diffusion in generative modeling and reveals that deterministic reverse processes, established in continuous diffusion, can also work effectively in discrete spaces.

Abstract: We propose a deterministic denoising algorithm for discrete-state diffusion models based on Markov chains. The generative reverse process is derandomized by introducing a variant of the herding algorithm with weakly chaotic dynamics, which induces deterministic discrete state transitions. Our approach is a direct replacement for the stochastic denoising process, requiring neither retraining nor continuous state embeddings. We demonstrate consistent improvements in both efficiency and sample quality on text and image generation tasks. Thus, this simple derandomization approach is expected to enhance the significance of discrete diffusion in generative modeling. Furthermore, our results reveal that deterministic reverse processes, well established in continuous diffusion, can also be effective in discrete state spaces.

[308] How to Set $β_1, β_2$ in Adam: An Online Learning Perspective

Quan Nguyen

Main category: cs.LG

TL;DR: This paper provides a more complete theoretical understanding of Adam’s momentum parameters β₁ and β₂, extending prior analyses that only worked for β₁ = √β₂ to cover both β₁ ≥ √β₂ and β₁ ≤ √β₂ cases with tight worst-case bounds.

DetailsMotivation: While Adam is highly effective for training large-scale models, there's incomplete theoretical understanding of how to optimally set its momentum parameters β₁ and β₂. Prior analyses only covered the restrictive case where β₁ = √β₂, which doesn't match practical usage where β₁ ≠ √β₂.

Method: The authors derive novel, more general analyses by viewing Adam through the Follow-the-Regularized-Leader (FTRL) framework. They develop theoretical bounds that hold for both β₁ ≥ √β₂ and β₁ ≤ √β₂ cases, and prove these bounds are tight in the worst case.

Result: The new analyses strictly generalize existing bounds and show that setting β₁ = √β₂ is optimal for oblivious adversaries but sub-optimal for non-oblivious adversaries. The bounds are proven to be tight in worst-case scenarios.

Conclusion: The paper provides a more complete theoretical foundation for Adam optimization, offering guidance on optimal momentum parameter settings for different adversarial scenarios and extending the practical applicability of theoretical analyses beyond the restrictive β₁ = √β₂ case.

Abstract: While Adam is one of the most effective optimizer for training large-scale machine learning models, a theoretical understanding of how to optimally set its momentum factors, $β_1$ and $β_2$, remains largely incomplete. Prior works have shown that Adam can be seen as an instance of Follow-the-Regularized-Leader (FTRL), one of the most important class of algorithms in online learning. The prior analyses in these works required setting $β_1 = \sqrt{β_2}$, which does not cover the more practical cases with $β_1 \neq \sqrt{β_2}$. We derive novel, more general analyses that hold for both $β_1 \geq \sqrt{β_2}$ and $β_1 \leq \sqrt{β_2}$. In both cases, our results strictly generalize the existing bounds. Furthermore, we show that our bounds are tight in the worst case. We also prove that setting $β_1 = \sqrt{β_2}$ is optimal for an oblivious adversary, but sub-optimal for an non-oblivious adversary.

[309] Non-Asymptotic Analysis of Efficiency in Conformalized Regression

Yunzhen Yao, Lie He, Michael Gastpar

Main category: cs.LG

TL;DR: Non-asymptotic bounds on efficiency deviation for conformalized regression, showing phase transitions in convergence rates across miscoverage regimes.

DetailsMotivation: Prior work treats miscoverage level α as fixed constant, but efficiency of conformal prediction depends on α, training size n, and calibration size m. Need to understand joint dependence and provide guidance for data allocation.

Method: Analyze conformalized quantile and median regression trained via SGD. Establish non-asymptotic bounds on deviation of prediction set length from oracle interval length under mild data distribution assumptions.

Result: Bounds of order O(1/√n + 1/(α²n) + 1/√m + exp(-α²m)) capture joint dependence on n, m, α. Identify phase transitions in convergence rates across different α regimes, offering data allocation guidance.

Conclusion: Theoretical analysis provides non-asymptotic efficiency bounds for conformalized regression, revealing phase transitions in convergence rates. Results guide optimal data allocation between training and calibration sets to control excess prediction set length.

Abstract: Conformal prediction provides prediction sets with coverage guarantees. The informativeness of conformal prediction depends on its efficiency, typically quantified by the expected size of the prediction set. Prior work on the efficiency of conformalized regression commonly treats the miscoverage level $α$ as a fixed constant. In this work, we establish non-asymptotic bounds on the deviation of the prediction set length from the oracle interval length for conformalized quantile and median regression trained via SGD, under mild assumptions on the data distribution. Our bounds of order $\mathcal{O}(1/\sqrt{n} + 1/(α^2 n) + 1/\sqrt{m} + \exp(-α^2 m))$ capture the joint dependence of efficiency on the proper training set size $n$, the calibration set size $m$, and the miscoverage level $α$. The results identify phase transitions in convergence rates across different regimes of $α$, offering guidance for allocating data to control excess prediction set length. Empirical results are consistent with our theoretical findings.

[310] BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training

Wenjie Zhou, Bohan Wang, Wei Chen, Xueqi Cheng

Main category: cs.LG

TL;DR: BSFA is a plug-and-play framework that accelerates deep learning training by differentially scaling parameter updates in Dom-space (top eigendirections) and Bulk-space (orthogonal component), achieving ~2x speedup on LLaMA models.

DetailsMotivation: Recent research shows that parameter updates along top Hessian eigendirections (Dom-space) contribute little to loss reduction despite having large magnitudes, while orthogonal updates (Bulk-space) drive most learning progress. This inefficiency in optimization motivates a method to better utilize these subspaces.

Method: BSFA differentially scales update components projected onto Dom-space and Bulk-space: moderates Dom-space updates for stability and amplifies Bulk-space updates for faster convergence. Uses PCA on historical updates for efficient subspace estimation and block-wise strategy for scalability to large models.

Result: Achieves approximately 2x speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW. Demonstrates acceleration across various tasks.

Conclusion: BSFA provides a practical, scalable framework that leverages the dichotomy between Dom-space and Bulk-space updates to accelerate deep learning optimization while maintaining stability, making it suitable for contemporary large models.

Abstract: Recent studies \citep{gur2018gradient,song2024does, wen2024understanding} highlight a fundamental dichotomy in deep learning optimization: Although parameter updates along the top eigendirections of the loss Hessian (Dom-space) capture most of the update magnitude, they often contribute minimally to loss reduction. In contrast, updates in the orthogonal component (Bulk-space) have smaller magnitudes but drive most learning progress. In this work, we further advance the understanding of this phenomenon and introduce the \textbf{Bulk-Space-Filtration-Accelerator (BSFA)}, a novel plug-and-play framework. BSFA accelerates training by differentially scaling update components projected onto these distinct subspaces, simultaneously enhancing stability by moderating updates in the dominant subspace and boosting convergence speed by amplifying those in the bulk-space. To ensure BSFA is both practical and scalable for contemporary large models, we introduce two key innovations: an efficient estimator using Principal Component Analysis (PCA) on historical updates for fast subspace estimation, and a block-wise strategy that applies this estimation on a per-parameter-block basis. These designs make BSFA computationally tractable and highly effective. We demonstrate BSFA’s acceleration across various tasks, notably achieving approximately 2$\times$ speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW.

[311] Object-Centric World Models for Causality-Aware Reinforcement Learning

Yosuke Nishimoto, Takashi Matsubara

Main category: cs.LG

TL;DR: STICA is a reinforcement learning framework that uses object-centric Transformers as world models with causality-aware policy/value networks, outperforming SOTA in sample efficiency and performance on object-rich environments.

DetailsMotivation: Traditional world models struggle with high-dimensional, non-stationary environments with multiple interacting objects because they learn holistic representations. Humans decompose environments into discrete objects for efficient decision-making, inspiring an object-centric approach.

Method: STICA uses object-centric Transformers as world models, representing observations as sets of object-centric tokens plus action and reward tokens. The world model predicts token-level dynamics and interactions. Policy/value networks estimate token-level cause-effect relations and use them in attention layers for causality-guided decision-making.

Result: Experiments on object-rich benchmarks show STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.

Conclusion: Object-centric representations combined with causality-aware learning provide an effective approach for reinforcement learning in complex, multi-object environments, addressing limitations of holistic world models.

Abstract: World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose \emph{Slot Transformer Imagination with CAusality-aware reinforcement learning} (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause–effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.

[312] Modeling Microenvironment Trajectories on Spatial Transcriptomics with NicheFlow

Kristiyan Sakalyan, Alessandro Palma, Filippo Guerranti, Fabian J. Theis, Stephan Günnemann

Main category: cs.LG

TL;DR: NicheFlow is a flow-based generative model that infers temporal trajectories of cellular microenvironments from spatiotemporal data using optimal transport and Variational Flow Matching on point cloud representations of cell neighborhoods.

DetailsMotivation: Current methods for modeling cellular evolution operate at single-cell level and overlook coordinated development of cellular states in tissues, despite spatial transcriptomics enabling high-resolution mapping of tissue organization across space and time.

Method: Represents local cell neighborhoods as point clouds and jointly models evolution of cell states and spatial coordinates using optimal transport and Variational Flow Matching.

Result: Successfully recovers both global spatial architecture and local microenvironment composition across diverse spatiotemporal datasets, including embryonic and brain development.

Conclusion: NicheFlow provides a novel approach to understanding cellular microenvironment evolution by modeling coordinated development of cellular states in tissues, addressing limitations of single-cell level methods.

Abstract: Understanding the evolution of cellular microenvironments in spatiotemporal data is essential for deciphering tissue development and disease progression. While experimental techniques like spatial transcriptomics now enable high-resolution mapping of tissue organization across space and time, current methods that model cellular evolution operate at the single-cell level, overlooking the coordinated development of cellular states in a tissue. We introduce NicheFlow, a flow-based generative model that infers the temporal trajectory of cellular microenvironments across sequential spatial slides. By representing local cell neighborhoods as point clouds, NicheFlow jointly models the evolution of cell states and spatial coordinates using optimal transport and Variational Flow Matching. Our approach successfully recovers both global spatial architecture and local microenvironment composition across diverse spatiotemporal datasets, from embryonic to brain development.

[313] Periodic Asynchrony: An Effective Method for Accelerating Reinforcement Learning for Large Language Models

Jian Lu

Main category: cs.LG

TL;DR: The paper proposes a periodically asynchronous framework for RL training that separates inference and training deployment, achieving 3x performance improvement on NPU platforms while maintaining algorithm accuracy equivalent to synchronous methods.

DetailsMotivation: Mainstream RL frameworks deploy inference and training on the same devices, creating computational coupling that prevents concurrent execution and limits training efficiency. The authors aim to address this bottleneck by decoupling these components.

Method: 1) Separate inference and training deployment into a periodically asynchronous framework; 2) Improve data loader to enable demand-driven, independent scaling; 3) Use unified tri-model architecture in training phase; 4) Introduce shared-prompt attention mask to reduce repetitive computation.

Result: Achieves at least threefold overall performance improvement in RL training on NPU platforms while maintaining algorithm accuracy completely equivalent to synchronous methods (both on-policy strategies).

Conclusion: The periodically asynchronous framework with separated inference/training deployment enables elastic scaling, reduces computational coupling, and significantly improves RL training efficiency without sacrificing accuracy, showing potential for widespread application.

Abstract: Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.

[314] Efficient Curvature-aware Graph Network

Chaoqun Fei, Tinglve Zhou, Tianyong Hao, Yangyang Li

Main category: cs.LG

TL;DR: Proposes Effective Resistance Curvature as a computationally efficient alternative to Ollivier-Ricci curvature for enhancing GNNs with geometric priors.

DetailsMotivation: Graph curvature provides valuable geometric priors for GNNs but existing methods like Ollivier-Ricci curvature have prohibitively high computational complexity, limiting their applicability to large-scale graphs.

Method: Introduces Effective Resistance Curvature, which quantifies message passing ease along graph edges using effective resistance between node pairs instead of optimal transport distance, offering significantly better computational efficiency.

Result: The method achieves competitive performance with Ollivier-Ricci curvature on diverse GNN tasks while drastically reducing computational overhead, with theoretical proofs of low complexity and substitutability.

Conclusion: Effective Resistance Curvature provides a practical, computationally efficient alternative to Ollivier-Ricci curvature for incorporating geometric priors into GNNs, enabling application to large-scale graph datasets.

Abstract: Graph curvature provides geometric priors for Graph Neural Networks (GNNs), enhancing their ability to model complex graph structures, particularly in terms of structural awareness, robustness, and theoretical interpretability. Among existing methods, Ollivier-Ricci curvature has been extensively studied due to its strong geometric interpretability, effectively characterizing the local geometric distribution between nodes. However, its prohibitively high computational complexity limits its applicability to large-scale graph datasets. To address this challenge, we propose a novel graph curvature measure–Effective Resistance Curvature–which quantifies the ease of message passing along graph edges using the effective resistance between node pairs, instead of the optimal transport distance. This method significantly outperforms Ollivier-Ricci curvature in computational efficiency while preserving comparable geometric expressiveness. Theoretically, we prove the low computational complexity of effective resistance curvature and establish its substitutability for Ollivier-Ricci curvature. Furthermore, extensive experiments on diverse GNN tasks demonstrate that our method achieves competitive performance with Ollivier-Ricci curvature while drastically reducing computational overhead.

[315] Parameter-Efficient and Personalized Federated Training of Generative Models at the Edge

Kabir Khan, Manju Sarkar, Anita Kar, Suresh Ghosh

Main category: cs.LG

TL;DR: FedGen-Edge is a federated learning framework that decouples frozen pre-trained generative models from lightweight client adapters, using LoRA to reduce communication by 99% while enabling personalization on edge devices.

DetailsMotivation: Large generative models are hard to train in federated settings due to heavy computation/communication and statistical/system heterogeneity. There's a need for privacy-preserving, resource-aware generative AI on edge devices.

Method: Decouples frozen pre-trained global backbone from lightweight client-side adapters, uses Low-Rank Adaptation (LoRA) to constrain client updates to compact subspace, federates only adapters using FedAvg-style server.

Result: Achieves lower perplexity/FID and faster convergence than baselines on language modeling (PTB) and image generation (CIFAR-10), reduces uplink traffic by >99% vs full-model FedAvg, stabilizes aggregation under non-IID data.

Conclusion: FedGen-Edge offers practical path toward privacy-preserving, resource-aware, and personalized generative AI on heterogeneous edge devices, with diminishing returns beyond moderate LoRA rank and trade-off between local epochs and client drift.

Abstract: Large generative models (for example, language and diffusion models) enable high-quality text and image synthesis but are hard to train or adapt in cross-device federated settings due to heavy computation and communication and statistical/system heterogeneity. We propose FedGen-Edge, a framework that decouples a frozen, pre-trained global backbone from lightweight client-side adapters and federates only the adapters. Using Low-Rank Adaptation (LoRA) constrains client updates to a compact subspace, which reduces uplink traffic by more than 99 percent versus full-model FedAvg, stabilizes aggregation under non-IID data, and naturally supports personalization because each client can keep a locally tuned adapter. On language modeling (PTB) and image generation (CIFAR-10), FedGen-Edge achieves lower perplexity/FID and faster convergence than strong baselines while retaining a simple FedAvg-style server. A brief ablation shows diminishing returns beyond moderate LoRA rank and a trade-off between local epochs and client drift. FedGen-Edge offers a practical path toward privacy-preserving, resource-aware, and personalized generative AI on heterogeneous edge devices.

[316] Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao

Main category: cs.LG

TL;DR: Equivariant networks have symmetry bias but limited expressivity; increasing model size compensates while maintaining lower hypothesis complexity for better generalization.

DetailsMotivation: To understand the trade-off between symmetry constraints and expressive power in equivariant neural networks, particularly how equivariance affects expressivity and whether model size adjustments can overcome limitations while preserving generalization benefits.

Method: Analyze 2-layer ReLU networks by examining boundary hyperplanes and channel vectors, construct examples showing expressivity limitations from equivariance constraints, then demonstrate compensation through model size enlargement while measuring hypothesis space complexity.

Result: Equivariance constraints strictly limit expressive power, but enlarging model size can compensate. Despite larger models, equivariant networks maintain lower hypothesis space complexity, suggesting superior generalization capabilities.

Conclusion: Equivariant networks face expressivity-constraint trade-offs, but strategic model size increases can overcome limitations while preserving the generalization advantages of symmetry constraints through reduced hypothesis space complexity.

Abstract: Equivariant neural networks encode symmetry as an inductive bias and have achieved strong empirical performance in wide domains. However, their expressive power remains not well understood. Focusing on 2-layer ReLU networks, this paper investigates the impact of equivariance constraints on the expressivity of equivariant and layer-wise equivariant networks. By examining the boundary hyperplanes and the channel vectors of ReLU networks, we construct an example showing that equivariance constraints could strictly limit expressive power. However, we demonstrate that this drawback can be compensated via enlarging the model size. Furthermore, we show that despite a larger model size, the resulting architecture could still correspond to a hypothesis space with lower complexity, implying superior generalizability for equivariant networks.

[317] Fairness-Aware Graph Representation Learning with Limited Demographic Information

Zichong Wang, Zhipeng Yin, Liping Yang, Jun Zhuang, Rui Yu, Qingzhao Kong, Wenbin Zhang

Main category: cs.LG

TL;DR: FairGLite: A lightweight fair graph learning framework that mitigates bias with limited demographic information using proxy generation, consistent embeddings, and adaptive confidence strategies.

DetailsMotivation: Most existing fair graph learning methods require full demographic information, which is rarely available in practice due to privacy, legal, or regulatory restrictions. There's a need for fair graph learning that works with limited demographic data.

Method: 1) Uses partial demographic data to generate proxies for missing demographic information; 2) Enforces consistent node embeddings across demographic groups; 3) Implements adaptive confidence strategy that dynamically adjusts each node’s contribution to fairness and utility based on prediction confidence.

Result: Theoretical analysis shows FairGLite achieves provable upper bounds on group fairness metrics. Extensive experiments on multiple datasets and fair graph learning frameworks demonstrate effectiveness in both mitigating bias and maintaining model utility.

Conclusion: FairGLite provides a practical solution for fair graph learning with limited demographic information, offering formal guarantees for bias mitigation while preserving model performance.

Abstract: Ensuring fairness in Graph Neural Networks is fundamental to promoting trustworthy and socially responsible machine learning systems. In response, numerous fair graph learning methods have been proposed in recent years. However, most of them assume full access to demographic information, a requirement rarely met in practice due to privacy, legal, or regulatory restrictions. To this end, this paper introduces a novel fair graph learning framework that mitigates bias in graph learning under limited demographic information. Specifically, we propose a mechanism guided by partial demographic data to generate proxies for demographic information and design a strategy that enforces consistent node embeddings across demographic groups. In addition, we develop an adaptive confidence strategy that dynamically adjusts each node’s contribution to fairness and utility based on prediction confidence. We further provide theoretical analysis demonstrating that our framework, FairGLite, achieves provable upper bounds on group fairness metrics, offering formal guarantees for bias mitigation. Through extensive experiments on multiple datasets and fair graph learning frameworks, we demonstrate the framework’s effectiveness in both mitigating bias and maintaining model utility.

[318] An Efficient Embedding Based Ad Retrieval with GPU-Powered Feature Interaction

Yifan Lei, Jiahua Luo, Tingyu Jiang, Bo Zhang, Lifeng Wang, Dapeng Liu, Zhaoren Wu, Haijie Gu, Huan Yu, Jie Jiang

Main category: cs.LG

TL;DR: Proposes GPU-based feature interaction for dual-tower retrieval networks to improve accuracy while reducing computational costs, first industry implementation of Wide & Deep in retrieval systems.

DetailsMotivation: Dual-tower EBR models in ad retrieval have insufficient feature interaction (only final inner product), while DNN models with early interaction are computationally infeasible for retrieval stage.

Method: Introduces efficient GPU-based feature interaction using novel compressed inverted list for GPU acceleration, enabling feature interaction at scale in retrieval systems.

Result: Outperforms existing approaches in offline evaluation, successfully deployed to Tencent Advertising with significant online performance gains.

Conclusion: First industry framework to implement Wide and Deep in retrieval, validates effectiveness and provides practical guidance for optimizing large-scale ad retrieval systems.

Abstract: In large-scale advertising recommendation systems, retrieval serves as a critical component, aiming to efficiently select a subset of candidate ads relevant to user behaviors from a massive ad inventory for subsequent ranking and recommendation. The Embedding-Based Retrieval (EBR) methods modeled by the dual-tower network are widely used in the industry to maintain both retrieval efficiency and accuracy. However, the dual-tower model has significant limitations: the embeddings of users and ads interact only at the final inner product computation, resulting in insufficient feature interaction capabilities. Although DNN-based models with both user and ad as input features, allowing for early-stage interaction between these features, are introduced in the ranking stage to mitigate this issue, they are computationally infeasible for the retrieval stage. To bridge this gap, this paper proposes an efficient GPU-based feature interaction for the dual-tower network to significantly improve retrieval accuracy while substantially reducing computational costs. Specifically, we introduce a novel compressed inverted list designed for GPU acceleration, enabling efficient feature interaction computation at scale. To the best of our knowledge, this is the first framework in the industry to successfully implement Wide and Deep in a retrieval system. We apply this model to the real-world business scenarios in Tencent Advertising, and experimental results demonstrate that our method outperforms existing approaches in offline evaluation and has been successfully deployed to Tencent’s advertising recommendation system, delivering significant online performance gains. This improvement not only validates the effectiveness of the proposed method, but also provides new practical guidance for optimizing large-scale ad retrieval systems.

[319] Scaling Behavior of Discrete Diffusion Language Models

Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto

Main category: cs.LG

TL;DR: Discrete diffusion language models (DLMs) have different scaling laws than autoregressive models, with uniform diffusion being more parameter-efficient and data-efficient than masked diffusion, especially in data-limited scenarios.

DetailsMotivation: Understanding scaling behavior is crucial for LLM pre-training due to massive compute/data requirements. While DLMs offer an alternative to autoregressive models, their scaling laws remain underexplored, with prior work suggesting they need more resources to match ALM performance.

Method: Study DLM scaling behavior across different noise types by smoothly interpolating between masked and uniform diffusion. Carefully tune hyperparameters like batch size and learning rate. Scale uniform diffusion models up to 10B parameters trained for 10^22 FLOPs.

Result: DLM scaling strongly depends on noise type and differs from ALMs. While all noise types converge to similar loss in compute-bound scaling, uniform diffusion requires more parameters but less data for compute-efficient training compared to masked diffusion. The 10B uniform diffusion model confirms predicted scaling behavior.

Conclusion: Uniform diffusion DLMs are promising for data-bound settings due to their parameter efficiency and reduced data requirements, making them a viable alternative to autoregressive models with different scaling characteristics.

Abstract: Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

[320] Predictive Modeling of Flood-Prone Areas Using SAR and Environmental Variables

Edwin Oluoch Awino, Denis Machanda

Main category: cs.LG

TL;DR: SAR imagery combined with environmental data and machine learning models to map flood susceptibility in Kenya’s River Nyando watershed, with Random Forest showing best performance.

DetailsMotivation: Flooding is a destructive natural hazard worldwide, posing serious risks to ecosystems, infrastructure, and human livelihoods. There's a need for effective flood susceptibility mapping, particularly in data-limited regions like western Kenya.

Method: Combined Sentinel-1 SAR imagery from May 2024 flood event with six conditioning factors (slope, elevation, aspect, land use/land cover, soil type, distance from streams). Used SAR-derived flood inventory as training data for four ML classifiers: Logistic Regression, Classification and Regression Trees, Support Vector Machines, and Random Forest.

Result: Random Forest achieved highest predictive performance (accuracy = 0.762; Kappa = 0.480), outperforming other models. RF-based susceptibility map identified low-lying Kano Plains near Lake Victoria as having highest flood vulnerability, consistent with historical records and May 2024 event impacts.

Conclusion: Combining SAR data with ensemble ML methods is valuable for flood susceptibility mapping in data-limited regions. The resulting maps provide important insights for disaster risk reduction, land-use planning, and early warning system development.

Abstract: Flooding is one of the most destructive natural hazards worldwide, posing serious risks to ecosystems, infrastructure, and human livelihoods. This study combines Synthetic Aperture Radar (SAR) imagery with environmental and hydrological data to model flood susceptibility in the River Nyando watershed, western Kenya. Sentinel-1 dual-polarization SAR data from the May 2024 flood event were processed to produce a binary flood inventory, which served as training data for machine learning (ML) models. Six conditioning factors – slope, elevation, aspect, land use/land cover, soil type, and distance from streams – were integrated with the SAR-derived flood inventory to train four supervised classifiers: Logistic Regression (LR), Classification and Regression Trees (CART), Support Vector Machines (SVM), and Random Forest (RF). Model performance was assessed using accuracy, Cohen’s Kappa, and Receiver Operating Characteristic (ROC) analysis. Results indicate that RF achieved the highest predictive performance (accuracy = 0.762; Kappa = 0.480), outperforming LR, CART, and SVM. The RF-based susceptibility map showed that low-lying Kano Plains near Lake Victoria have the highest flood vulnerability, consistent with historical flood records and the impacts of the May 2024 event. These findings demonstrate the value of combining SAR data and ensemble ML methods for flood susceptibility mapping in regions with limited data. The resulting maps offer important insights for disaster risk reduction, land-use planning, and early warning system development.

[321] M2RU: Memristive Minion Recurrent Unit for On-Chip Continual Learning at the Edge

Abdullah M. Zyarah, Dhireesha Kudithipudi

Main category: cs.LG

TL;DR: M2RU is a mixed-signal architecture implementing minion recurrent unit for efficient temporal processing with on-chip continual learning, achieving 312 GOPS/W and 29× energy efficiency improvement over CMOS digital designs.

DetailsMotivation: Continual learning on edge platforms is challenging due to energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. There's a need for efficient temporal processing with on-chip continual learning capabilities.

Method: Introduces M2RU architecture with: 1) Weighted-bit streaming enabling multi-bit digital inputs to be processed in crossbars without high-resolution conversion, 2) Experience replay mechanism to stabilize learning under domain shifts, 3) Implementation of minion recurrent unit for efficient temporal processing.

Result: Achieves 15 GOPS at 48.62 mW (312 GOPS/W), maintains accuracy within 5% of software baselines on sequential MNIST and CIFAR-10 tasks, provides 29× improvement in energy efficiency compared to CMOS digital design, and shows expected operational lifetime of 12.2 years under continual learning workloads.

Conclusion: M2RU establishes a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence, addressing the challenges of continual learning on edge platforms through mixed-signal architecture innovations.

Abstract: Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.

[322] Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

Yidong Chai, Yi Liu, Mohammadreza Ebrahimi, Weifeng Li, Balaji Padmanabhan

Main category: cs.LG

TL;DR: LLM-SGA framework enhances adversarial robustness for harmful content detection using LLM-based sample generation and aggregation, with ARHOCD detector achieving strong generalizability and improved accuracy through ensemble methods and dynamic weight assignment.

DetailsMotivation: Social media platforms face harmful content (hate speech, misinformation, extremism) that evades ML detection through adversarial attacks. Current detectors lack robustness against diverse attacks while maintaining accuracy.

Method: Two-phase approach: 1) LLM-SGA framework identifies textual adversarial attack invariances for generalizability; 2) ARHOCD detector with three novel components: ensemble of base detectors, dynamic weight assignment using Bayesian inference and domain knowledge, and iterative adversarial training optimizing both detectors and weight assignor.

Result: Empirical evaluation across three datasets (hate speech, rumor, extremist content) shows ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions, addressing limitations of existing adversarial robustness research.

Conclusion: The proposed LLM-SGA framework and ARHOCD detector successfully enhance adversarial robustness for harmful content detection, achieving both strong generalizability against diverse attacks and improved accuracy through novel ensemble and training strategies.

Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample’s predictability and each base detector’s capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

[323] When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper identifies sufficient conditions for renormalizable coarse-grained dynamics in deep learning systems and shows how power-law scaling emerges as a rigidity consequence when log-shift invariance combines with gradient flow time-rescaling covariance.

DetailsMotivation: Power-law scaling is widely observed in deep learning systems but its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution-Shell Dynamics (GRSD) framework provides a coarse-grained description, but power-law behavior isn't automatic and requires additional structural properties.

Method: The authors identify sufficient conditions for GRSD shell dynamics to admit a renormalizable coarse-grained description. These conditions include: boundedness of gradient propagation, weak functional incoherence at initialization, controlled Jacobian evolution during training, and log-shift invariance of renormalized shell couplings. They analyze how power-law scaling emerges from the combination of log-shift invariance and time-rescaling covariance of gradient flow.

Result: The paper establishes that power-law scaling doesn’t follow from renormalizability alone, but arises as a rigidity consequence: when log-shift invariance is combined with the intrinsic time-rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power-law form.

Conclusion: The work provides theoretical foundations for understanding power-law scaling in deep learning by identifying specific structural conditions under which renormalizable coarse-grained dynamics emerge, and showing how power-law behavior is enforced by the interplay between log-shift invariance and gradient flow properties.

Abstract: Empirical power–law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution–Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse–grained dynamical description of training. Within GRSD, power–law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse–grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log–shift invariance of renormalized shell couplings. We further show that power–law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log–shift invariance is combined with the intrinsic time–rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power–law form.

[324] Learning from sanctioned government suppliers: A machine learning and network science approach to detecting fraud and corruption in Mexico

Martí Medina-Hernández, Janos Kertész, Mihály Fazekas

Main category: cs.LG

TL;DR: PU learning algorithm combining network features and traditional red flags outperforms conventional methods in detecting corrupt public procurement contracts in Mexico.

DetailsMotivation: Detecting fraud in public procurement is challenging due to lack of confirmed non-corrupt examples, making conventional supervised ML inappropriate. Need for better methods that can work with positive-unlabeled data.

Method: Positive-unlabeled (PU) learning algorithms integrating domain-knowledge-based red flags with network-derived features using Mexican federal procurement data and company sanction records.

Result: Best PU model captures 32% more known positives and performs 2.3x better than random guessing, substantially outperforming traditional red flag approaches. Network features (contracts in network core, supplier eigenvector centrality) are most important.

Conclusion: The methodology effectively identifies likely corrupt contracts and can support law enforcement in Mexico while being adaptable to other national contexts.

Abstract: Detecting fraud and corruption in public procurement remains a major challenge for governments worldwide. Most research to-date builds on domain-knowledge-based corruption risk indicators of individual contract-level features and some also analyzes contracting network patterns. A critical barrier for supervised machine learning is the absence of confirmed non-corrupt, negative, examples, which makes conventional machine learning inappropriate for this task. Using publicly available data on federally funded procurement in Mexico and company sanction records, this study implements positive-unlabeled (PU) learning algorithms that integrate domain-knowledge-based red flags with network-derived features to identify likely corrupt and fraudulent contracts. The best-performing PU model on average captures 32 percent more known positives and performs on average 2.3 times better than random guessing, substantially outperforming approaches based solely on traditional red flags. The analysis of the Shapley Additive Explanations reveals that network-derived features, particularly those associated with contracts in the network core or suppliers with high eigenvector centrality, are the most important. Traditional red flags further enhance model performance in line with expectations, albeit mainly for contracts awarded through competitive tenders. This methodology can support law enforcement in Mexico, and it can be adapted to other national contexts too.

[325] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Tianle Cai, Wenhao Huang

Main category: cs.LG

TL;DR: This paper addresses LLM hallucinations by proposing behavior calibration methods that incentivize models to admit uncertainty through abstention, enabling smaller models to surpass frontier models in uncertainty quantification.

DetailsMotivation: LLM hallucinations impede deployment in critical domains. Current RLHF paradigms incentivize models to guess whenever correctness probability exceeds zero rather than being honest communicators, treating hallucination as stochastic error rather than a predictable statistical consequence of training objectives.

Method: Proposes behavior calibration training interventions optimizing strictly proper scoring rules for models to output calibrated probability of correctness. Models learn to either abstain from complete responses or flag uncertain individual claims. Uses Qwen3-4B-Instruct for empirical analysis.

Result: Behavior-calibrated RL allows smaller models to surpass frontier models in uncertainty quantification. On math reasoning (BeyondAIME), log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5’s (0.207). In factual QA (SimpleQA), 4B LLM achieves zero-shot calibration error on par with Grok-4 and Gemini-2.5-Pro despite lower factual accuracy.

Conclusion: Behavior calibration enables models to become honest communicators by admitting uncertainty, creating a transferable meta-skill decoupled from raw predictive accuracy. This approach addresses fundamental limitations of current RLHF paradigms and enables safer LLM deployment in critical domains.

Abstract: LLM deployment in critical domains is currently impeded by persistent hallucinations–generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification–a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model’s log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5’s (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.

[326] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning

Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang, Yuhang Yang, Ning Wang, Yucheng Wei, Liyao Li, Wentao Ye, Hao Chen, Tao Zhang, Junlin Zhou, Haobo Wang, Gang Chen, Junbo Zhao

Main category: cs.LG

TL;DR: TableGPT-R1 is a specialized tabular model using reinforcement learning to enhance complex reasoning and code execution on tabular data, overcoming challenges of trajectory scarcity, heterogeneous feedback, and catastrophic forgetting.

DetailsMotivation: Current LLMs fine-tuned via SFT fall short in handling complex multi-step reasoning and robust code execution needed for real-world table tasks. RL offers promise but faces three critical hurdles: scarcity of high-quality agentic trajectories with closed-loop execution, extreme heterogeneity of feedback signals, and risk of catastrophic forgetting during specialization.

Method: Systematic RL framework with: 1) Comprehensive data engineering pipeline synthesizing difficulty-stratified agentic trajectories for supervised alignment and RL rollouts, 2) Task-adaptive reward system combining rule-based verification with criteria-injected reward model and process-level step reward shaping with behavioral regularization, 3) Multi-stage training framework progressively stabilizing reasoning before specializing in table-specific tasks.

Result: TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities.

Conclusion: The proposed RL framework successfully overcomes key challenges in tabular data reasoning, enabling advanced capabilities for complex table tasks while maintaining general knowledge, with the model publicly available for use.

Abstract: Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.

[327] DiEC: Diffusion Embedded Clustering

Haidong Hu

Main category: cs.LG

TL;DR: DiEC is a diffusion-based clustering method that selects optimal representations from pretrained diffusion model trajectories (layer × timestep) for improved clustering performance.

DetailsMotivation: Most deep clustering methods use single embeddings from autoencoders or self-supervised encoders, but diffusion models produce rich representation trajectories where clusterability varies significantly across layers and timesteps. The authors want to exploit this trajectory for better clustering.

Method: Two-stage approach: 1) Uses U-Net bottleneck as Clustering Middle Layer (CML, l*), 2) Identifies Clustering-Optimal Timestep (COT, t*) via efficient subset-based, noise-averaged search. Then learns clustering embeddings through lightweight residual mapping with DEC-style KL self-training objective and structural regularization, plus random-timestep denoising-consistency loss for stability.

Result: Experiments on standard benchmarks show DiEC achieves strong clustering performance and demonstrates the importance of selecting appropriate diffusion representations for clustering.

Conclusion: Diffusion models provide valuable representation trajectories for clustering, and DiEC effectively exploits these by selecting optimal layer-timestep combinations, outperforming methods that use single embeddings.

Abstract: Deep clustering critically depends on representations that expose clear cluster structure, yet most prior methods learn a single embedding with an autoencoder or a self-supervised encoder and treat it as the primary representation for clustering. In contrast, a pretrained diffusion model induces a rich representation trajectory over network layers and noise timesteps, along which clusterability varies substantially. We propose Diffusion Embedded Clustering (DiEC), an unsupervised clustering framework that exploits this trajectory by directly leveraging intermediate activations of a pretrained diffusion U-Net. DiEC formulates representation selection over layer * timestep and adopts a practical two-stage procedure: it uses the U-Net bottleneck as the Clustering Middle Layer (CML, l*) and identifies the Clustering-Optimal Timestep (COT, t*) via an efficient subset-based, noise-averaged search. Conditioning on (l*, t*), DiEC learns clustering embeddings through a lightweight residual mapping, optimized with a DEC-style KL self-training objective and structural regularization, while a parallel random-timestep denoising-consistency loss stabilizes training and preserves diffusion behavior. Experiments on standard benchmarks demonstrate that DiEC achieves strong clustering performance and reveal the importance of selecting diffusion representations for clustering.

cs.MA

[328] The AI Committee: A Multi-Agent Framework for Automated Validation and Remediation of Web-Sourced Data

Sunith Vallabhaneni, Thomas Berkane, Maimuna Majumder

Main category: cs.MA

TL;DR: AI Committee: A multi-agent system using specialized LLM agents to automatically validate and fix errors in web-sourced datasets, achieving up to 78.7% completeness and 100% precision without task-specific training.

DetailsMotivation: Manual web data collection for research is labor-intensive and error-prone. Current LLM-powered web agents struggle with data validity issues like hallucinations, omissions, misinterpretations, and failure to detect invalid information, which are subtle and hard to correct manually.

Method: Introduces AI Committee, a model-agnostic multi-agent system where each agent specializes in different data quality tasks (source scrutiny, fact-checking, data remediation, integrity validation). Leverages LLM capabilities like in-context learning for dataset adaptation, chain-of-thought reasoning for semantic validation, and self-correction loops for remediation - all without task-specific training.

Result: Applied to three real-world datasets, the system generalizes across LLMs and significantly outperforms baselines, achieving data completeness up to 78.7% and precision up to 100%. Ablation study shows contribution of each agent to overall performance.

Conclusion: AI Committee effectively automates validation and remediation of web-sourced datasets, addressing key failure modes of current LLM web agents. Released as open-source tool for research community.

Abstract: Many research areas rely on data from the web to gain insights and test their methods. However, collecting comprehensive research datasets often demands manually reviewing many web pages to identify and record relevant data points, which is labor-intensive and susceptible to error. While the emergence of large language models (LLM)-powered web agents has begun to automate parts of this process, they often struggle to ensure the validity of the data they collect. Indeed, these agents exhibit several recurring failure modes - including hallucinating or omitting values, misinterpreting page semantics, and failing to detect invalid information - which are subtle and difficult to detect and correct manually. To address this, we introduce the AI Committee, a novel model-agnostic multi-agent system that automates the process of validating and remediating web-sourced datasets. Each agent is specialized in a distinct task in the data quality assurance pipeline, from source scrutiny and fact-checking to data remediation and integrity validation. The AI Committee leverages various LLM capabilities - including in-context learning for dataset adaptation, chain-of-thought reasoning for complex semantic validation, and a self-correction loop for data remediation - all without task-specific training. We demonstrate the effectiveness of our system by applying it to three real-world datasets, showing that it generalizes across LLMs and significantly outperforms baseline approaches, achieving data completeness up to 78.7% and precision up to 100%. We additionally conduct an ablation study demonstrating the contribution of each agent to the Committee’s performance. This work is released as an open-source tool for the research community.

[329] PERELMAN: Pipeline for scientific literature meta-analysis. Technical report

Daniil Sherki, Daniil Merkulov, Alexandra Savina, Ekaterina Muravleva

Main category: cs.MA

TL;DR: PERELMAN is an agentic framework that automates large-scale scientific literature meta-analysis by extracting and unifying information from heterogeneous articles through coordinated AI agents guided by domain expert knowledge.

DetailsMotivation: The motivation is to address the time-consuming and labor-intensive nature of traditional literature reviews and meta-analyses, which can take months to prepare. Current approaches struggle with heterogeneous article formats and inconsistent data representation across studies.

Method: PERELMAN uses a structured dialogue with subject-matter experts to elicit domain knowledge (target variables, inclusion criteria, units, normalization rules). This knowledge guides coordinated agents that extract evidence from narrative text, tables, and figures, transforming heterogeneous content into a unified, machine-readable representation.

Result: The system was evaluated by reproducing a meta-analysis of layered Li-ion cathode properties (NMC811 material) to assess reproducibility and validate implementation. The framework demonstrates potential to reduce meta-analysis preparation time from months to minutes.

Conclusion: PERELMAN represents a significant advancement in automating scientific literature analysis, offering a scalable solution for evidence synthesis that maintains consistency across studies while dramatically reducing the time and effort required for meta-analyses.

Abstract: We present PERELMAN (PipEline foR sciEntific Literature Meta-ANalysis), an agentic framework designed to extract specific information from a large corpus of scientific articles to support large-scale literature reviews and meta-analyses. Our central goal is to reliably transform heterogeneous article content into a unified, machine-readable representation. PERELMAN first elicits domain knowledge-including target variables, inclusion criteria, units, and normalization rules-through a structured dialogue with a subject-matter expert. This domain knowledge is then reused across multiple stages of the pipeline and guides coordinated agents in extracting evidence from narrative text, tables, and figures, enabling consistent aggregation across studies. In order to assess reproducibility and validate our implementation, we evaluate the system on the task of reproducing the meta-analysis of layered Li-ion cathode properties (NMC811 material). We describe our solution, which has the potential to reduce the time required to prepare meta-analyses from months to minutes.

[330] MASFIN: A Multi-Agent System for Decomposed Financial Reasoning and Forecasting

Marc S. Montalvo, Hamed Yaghoobian

Main category: cs.MA

TL;DR: MASFIN is a modular multi-agent framework that integrates LLMs with financial metrics and news to generate weekly portfolios with bias mitigation, achieving 7.33% cumulative return over 8 weeks.

DetailsMotivation: Traditional quantitative methods suffer from survivorship bias, while AI approaches struggle with signal integration, reproducibility, and computational efficiency in high-stakes financial environments requiring transparent analysis.

Method: MASFIN uses a modular multi-agent framework integrating LLMs (GPT-4.1-nano) with structured financial metrics and unstructured news, embedding explicit bias-mitigation protocols for reproducible, cost-efficient inference.

Result: In 8-week evaluation, MASFIN delivered 7.33% cumulative return, outperforming S&P 500, NASDAQ-100, and Dow Jones benchmarks in 6 of 8 weeks, though with higher volatility.

Conclusion: The framework demonstrates promise for bias-aware generative AI in financial forecasting and highlights opportunities for modular multi-agent design to advance practical, transparent, and reproducible quantitative finance approaches.

Abstract: Recent advances in large language models (LLMs) are transforming data-intensive domains, with finance representing a high-stakes environment where transparent and reproducible analysis of heterogeneous signals is essential. Traditional quantitative methods remain vulnerable to survivorship bias, while many AI-driven approaches struggle with signal integration, reproducibility, and computational efficiency. We introduce MASFIN, a modular multi-agent framework that integrates LLMs with structured financial metrics and unstructured news, while embedding explicit bias-mitigation protocols. The system leverages GPT-4.1-nano for reproducability and cost-efficient inference and generates weekly portfolios of 15-30 equities with allocation weights optimized for short-term performance. In an eight-week evaluation, MASFIN delivered a 7.33% cumulative return, outperforming the S&P 500, NASDAQ-100, and Dow Jones benchmarks in six of eight weeks, albeit with higher volatility. These findings demonstrate the promise of bias-aware, generative AI frameworks for financial forecasting and highlight opportunities for modular multi-agent design to advance practical, transparent, and reproducible approaches in quantitative finance.

[331] Computational Foundations for Strategic Coopetition: Formalizing Interdependence and Complementarity

Vik Pant, Eric Yu

Main category: cs.MA

TL;DR: This paper develops computational foundations to bridge the gap between qualitative conceptual modeling (i*) and quantitative game theory for analyzing strategic coopetition, formalizing interdependence and complementarity dimensions.

DetailsMotivation: Modern socio-technical systems involve strategic coopetition where actors simultaneously cooperate to create value and compete to capture it. Existing approaches have limitations: conceptual modeling languages like i* provide rich qualitative representations but lack quantitative analysis, while classical game theory offers mathematical rigor but strips away contextual richness.

Method: The paper develops computational foundations that formalize two critical dimensions of coopetition: (1) interdependence grounded in i* structural dependency analysis, translating relationships into quantitative interdependence coefficients, and (2) complementarity following Brandenburger and Nalebuff’s Added Value concept. It integrates structural dependencies with bargaining power in value appropriation and introduces a game-theoretic formulation where Nash Equilibrium incorporates structural interdependence.

Result: The approach was validated through comprehensive experimental testing comprising over 22,000 trials across power and logarithmic value function specifications, demonstrating functional form robustness. An empirical application to the Samsung-Sony S-LCD joint venture (2004-2011) was also conducted.

Conclusion: This technical report serves as the foundational reference for a coordinated research program examining strategic coopetition in multi-agent systems, bridging the gap between qualitative conceptual modeling and quantitative game theory for analyzing simultaneous cooperation and competition dynamics.

Abstract: Coopetition refers to simultaneous cooperation and competition among actors who “cooperate to grow the pie and compete to split it up.” Modern socio-technical systems are characterized by strategic coopetition in which actors concomitantly cooperate to create value and compete to capture it. While conceptual modeling languages such as i* provide rich qualitative representations of strategic dependencies, they lack mechanisms for quantitative analysis of dynamic trade-offs. Conversely, classical game theory offers mathematical rigor but strips away contextual richness. This technical report bridges this gap by developing computational foundations that formalize two critical dimensions of coopetition: interdependence and complementarity. We ground interdependence in i* structural dependency analysis, translating depender-dependee-dependum relationships into quantitative interdependence coefficients through a structured translation framework. We formalize complementarity following Brandenburger and Nalebuff’s Added Value concept, modeling synergistic value creation with validated parameterization. We integrate structural dependencies with bargaining power in value appropriation and introduce a game-theoretic formulation where Nash Equilibrium incorporates structural interdependence. Validation combines comprehensive experimental testing comprising over 22,000 trials across power and logarithmic value function specifications, demonstrating functional form robustness, with empirical application to the Samsung-Sony S-LCD joint venture (2004-2011). This technical report serves as the foundational reference for a coordinated research program examining strategic coopetition in multi-agent systems, with companion work addressing trust dynamics, collective action, and reciprocity mechanisms.

[332] A Plan Reuse Mechanism for LLM-Driven Agent

Guopeng Li, Ruiqi Wu, Haisheng Tan

Main category: cs.MA

TL;DR: AgentReuse: A plan reuse mechanism for LLM-driven agents that reduces latency by 93.12% through semantic similarity matching and intent classification, achieving 93% effective plan reuse rate.

DetailsMotivation: LLM-driven agents suffer from high latency (tens of seconds) when generating plans for user requests. Analysis shows ~30% of requests are identical or similar, enabling plan reuse to reduce latency, but direct text similarity evaluation is difficult due to natural language diversity and unstructured plan formats.

Method: AgentReuse uses semantic similarity analysis and intent classification to evaluate request similarities. It leverages the similarities and differences among requests’ semantics to enable plan reuse, addressing challenges of diverse natural language expressions and unstructured plan texts.

Result: Achieves 93% effective plan reuse rate, F1 score of 0.9718, accuracy of 0.9459 in evaluating request similarities, and reduces latency by 93.12% compared to baselines without reuse mechanism.

Conclusion: AgentReuse successfully addresses the latency problem in LLM-driven agents by enabling efficient plan reuse through semantic similarity and intent classification, significantly improving user experience while maintaining high accuracy in request matching.

Abstract: Integrating large language models (LLMs) into personal assistants, like Xiao Ai and Blue Heart V, effectively enhances their ability to interact with humans, solve complex tasks, and manage IoT devices. Such assistants are also termed LLM-driven agents. Upon receiving user requests, the LLM-driven agent generates plans using an LLM, executes these plans through various tools, and then returns the response to the user. During this process, the latency for generating a plan with an LLM can reach tens of seconds, significantly degrading user experience. Real-world dataset analysis shows that about 30% of the requests received by LLM-driven agents are identical or similar, which allows the reuse of previously generated plans to reduce latency. However, it is difficult to accurately define the similarity between the request texts received by the LLM-driven agent through directly evaluating the original request texts. Moreover, the diverse expressions of natural language and the unstructured format of plan texts make implementing plan reuse challenging. To address these issues, we present and implement a plan reuse mechanism for LLM-driven agents called AgentReuse. AgentReuse leverages the similarities and differences among requests’ semantics and uses intent classification to evaluate the similarities between requests and enable the reuse of plans. Experimental results based on a real-world dataset demonstrate that AgentReuse achieves a 93% effective plan reuse rate, an F1 score of 0.9718, and an accuracy of 0.9459 in evaluating request similarities, reducing latency by 93.12% compared with baselines without using the reuse mechanism.

cs.MM

eess.AS

[333] Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning

YuXiang Kong, JunFeng Hou, Jian Tang, Bingqing Zhu, Jicheng Zhang, Shaofei Xue

Main category: eess.AS

TL;DR: A two-stage framework combining hotword retrieval with LLM-ASR adaptation for effective large-vocabulary contextual biasing in speech recognition.

DetailsMotivation: LLM-based ASR performs well but struggles with contextual biasing for named entities and hotwords under large vocabularies, requiring a scalable solution.

Method: 1) Extend GLCLAP model with robustness-aware data augmentation and fuzzy matching to retrieve top-k hotword candidates from large vocabulary. 2) Inject retrieved candidates as textual prompts into LLM-ASR and fine-tune with GRPO using joint reward for hotword recognition and transcription accuracy.

Result: Experiments show substantial keyword error rate (KER) reductions on hotword-focused test sets while maintaining sentence accuracy on general ASR benchmarks.

Conclusion: The proposed scalable two-stage framework effectively addresses large-vocabulary contextual biasing challenges in LLM-based ASR systems.

Abstract: Large language model (LLM)-based automatic speech recognition (ASR) has recently achieved strong performance across diverse tasks, yet contextual biasing for named entities and hotwords under large vocabularies remains challenging. In this work, we propose a scalable two-stage framework that integrates hotword retrieval with LLM-ASR adaptation. First, we extend the Global-Local Contrastive Language-Audio pre-trained model (GLCLAP) to retrieve a compact top-k set of hotword candidates from a large vocabulary via robustness-aware data augmentation and fuzzy matching. Second, we inject the retrieved candidates as textual prompts into an LLM-ASR model and fine-tune it with Generative Rejection-Based Policy Optimization (GRPO), using a task-driven reward that jointly optimizes hotword recognition and overall transcription accuracy. Experiments on hotword-focused test sets show substantial keyword error rate (KER) reductions while maintaining sentence accuracy on general ASR benchmarks, demonstrating the effectiveness of the proposed framework for large-vocabulary contextual biasing.

[334] Rare Word Recognition and Translation Without Fine-Tuning via Task Vector in Speech Models

Ruihao Jing, Cheng Gong, Yu Jiang, Boyu Zhu, Shansong Liu, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

Main category: eess.AS

TL;DR: Training-free paradigm using task vectors for rare word recognition and translation, enabling flexible composition without fine-tuning.

DetailsMotivation: Rare words are a critical bottleneck for speech-to-text systems. Direct fine-tuning has issues: high cost, catastrophic forgetting, and limited scalability.

Method: Propose training-free paradigm based on task vectors. Define task vectors as parameter differences and introduce word-level task vector arithmetic for flexible composition of rare-word capabilities.

Result: Matches or surpasses fine-tuned models on target words, improves general performance by ~5 BLEU, and mitigates catastrophic forgetting across multiple domains.

Conclusion: Task vector approach provides scalable, reusable solution for rare word recognition and translation without training costs or catastrophic forgetting issues.

Abstract: Rare words remain a critical bottleneck for speech-to-text systems. While direct fine-tuning improves recognition of target words, it often incurs high cost, catastrophic forgetting, and limited scalability. To address these challenges, we propose a training-free paradigm based on task vectors for rare word recognition and translation. By defining task vectors as parameter differences and introducing word-level task vector arithmetic, our approach enables flexible composition of rare-word capabilities, greatly enhancing scalability and reusability. Extensive experiments across multiple domains show that the proposed method matches or surpasses fine-tuned models on target words, improves general performance by about 5 BLEU, and mitigates catastrophic forgetting.

[335] Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie

Main category: eess.AS

TL;DR: FPO (Fine-grained Preference Optimization) enhances TTS robustness by addressing localized audio issues rather than optimizing entire utterances, using selective training loss based on fine-grained issue categorization.

DetailsMotivation: Current TTS alignment methods use utterance-level preference data, but listening experience issues often occur only in specific segments while other parts are well-generated. There's a need for more granular optimization targeting localized problems.

Method: Proposes FPO approach that: 1) analyzes and categorizes issues in generated samples into two groups, 2) uses selective training loss strategy to optimize preferences based on fine-grained labels for each issue type, focusing on localized problems rather than entire utterances.

Result: FPO enhances zero-shot TTS robustness by effectively addressing local issues, significantly reduces bad case ratio, improves intelligibility, and shows superior data efficiency - achieving similar performance with fewer training samples compared to baselines.

Conclusion: Fine-grained preference optimization targeting localized audio issues is more effective than utterance-level optimization for improving TTS robustness and data efficiency, offering a promising direction for TTS system enhancement.

Abstract: Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of audio samples, while other segments are well-generated. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO focuses on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. Specifically, we first analyze the types of issues in generated samples, categorize them into two groups, and propose a selective training loss strategy to optimize preferences based on fine-grained labels for each issue type. Experimental results show that FPO enhances the robustness of zero-shot TTS systems by effectively addressing local issues, significantly reducing the bad case ratio, and improving intelligibility. Furthermore, FPO exhibits superior data efficiency compared with baseline systems, achieving similar performance with fewer training samples.

eess.IV

[336] A Graph-Augmented knowledge Distillation based Dual-Stream Vision Transformer with Region-Aware Attention for Gastrointestinal Disease Classification with Explainable AI

Md Assaduzzaman, Nushrat Jahan Oyshi, Eram Mahamud

Main category: eess.IV

TL;DR: Hybrid dual-stream deep learning framework using teacher-student knowledge distillation achieves near-perfect accuracy for gastrointestinal disease classification from endoscopic imagery, balancing efficiency and diagnostic performance.

DetailsMotivation: Accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery is challenging due to large data volumes and subtle visual variations between disease classes, requiring robust AI solutions for clinical diagnostics.

Method: Teacher-student knowledge distillation framework with high-capacity teacher model combining Swin Transformer (global context) and Vision Transformer (local features), distilled to compact Tiny-ViT student network. Uses two balanced Wireless Capsule Endoscopy datasets to prevent bias.

Result: Achieved remarkable performance with accuracies of 0.9978 and 0.9928 on two datasets, average AUC of 1.0000 (near-perfect discrimination). Tiny-ViT maintained comparable performance to teacher with reduced computational complexity and faster inference.

Conclusion: The framework provides robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, suitable for resource-constrained clinical environments and paving way for intelligent endoscopic screening compatible with clinical practicality.

Abstract: The accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery remains a significant challenge in medical diagnostics, mainly due to the vast data volume and subtle variation in inter-class visuals. This study presents a hybrid dual-stream deep learning framework built on teacher-student knowledge distillation, where a high-capacity teacher model integrates the global contextual reasoning of a Swin Transformer with the local fine-grained feature extraction of a Vision Transformer. The student network was implemented as a compact Tiny-ViT structure that inherits the teacher’s semantic and morphological knowledge via soft-label distillation, achieving a balance between efficiency and diagnostic accuracy. Two carefully curated Wireless Capsule Endoscopy datasets, encompassing major GI disease classes, were employed to ensure balanced representation and prevent inter-sample bias. The proposed framework achieved remarkable performance with accuracies of 0.9978 and 0.9928 on Dataset 1 and Dataset 2 respectively, and an average AUC of 1.0000, signifying near-perfect discriminative capability. Interpretability analyses using Grad-CAM, LIME, and Score-CAM confirmed that the model’s predictions were grounded in clinically significant tissue regions and pathologically relevant morphological cues, validating the framework’s transparency and reliability. The Tiny-ViT demonstrated diagnostic performance with reduced computational complexity comparable to its transformer-based teacher while delivering faster inference, making it suitable for resource-constrained clinical environments. Overall, the proposed framework provides a robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, paving the way toward future intelligent endoscopic screening that is compatible with clinical practicality.

[337] Enabling Ultra-Fast Cardiovascular Imaging Across Heterogeneous Clinical Environments with a Generalist Foundation Model and Multimodal Database

Zi Wang, Mingkai Huang, Zhang Shi, Hongjie Hu, Lan Lan, Hui Zhang, Yan Li, Xi Hu, Qing Lu, Zongming Zhu, Qiong Yao, Yuxiang Dai, Fanwen Wang, Yinzhe Wu, Jun Lyu, Qianqian Gao, Guangming Xu, Zhenxuan Zhang, Haosen Zhang, Qing Li, Guangming Wang, Tianxing He, Lizhen Lan, Siyue Li, Le Xue, Mengting Sun, Yuntong Lyu, Junpu Hu, Jiayu Zhu, Rizwan Ahmad, Zhengyu Bu, Xianling Qian, Guanke Cai, Ruiyu Cao, Weirui Cai, Chang Xu, Yuyang Ren, Feidan Yu, Siying Ma, Ziqiang Xu, Xinran Chen, Sha Hua, Daniel Kim, Yajing Zhang, Chen Ouyang, Wenjia Bai, Jing Qin, Yucheng Yang, Daniel Rueckert, He Wang, Qian Tao, Claudia Prieto, Michael Markl, Alistair Young, Lianming Wu, Shuo Wang, Chen Qin, Mengsu Zeng, Xihong Hu, Haibo Xu, Xiaobo Qu, Hao Li, Guang Yang, Chengyan Wang

Main category: eess.IV

TL;DR: CardioMM is a generalist reconstruction foundation model for ultra-fast cardiovascular MRI that achieves 24x acceleration while preserving diagnostic quality, supported by the largest multimodal CMR k-space database (MMCMR-427K).

DetailsMotivation: Despite decades of advancements, widespread clinical adoption of cardiovascular MRI remains limited by prolonged scan times and heterogeneity across medical environments, creating an urgent need for a generalist reconstruction foundation model for ultra-fast CMR imaging.

Method: The authors created MMCMR-427K (427,465 multi-coil k-space data with structured metadata from 13 international centers) and developed CardioMM, a generalist reconstruction foundation model that unifies semantic contextual understanding with physics-informed data consistency to adapt to heterogeneous fast CMR imaging scenarios.

Result: CardioMM achieves state-of-the-art performance in internal centers and exhibits strong zero-shot generalization to unseen external settings, reliably preserving key cardiac phenotypes, quantitative myocardial biomarkers, and diagnostic image quality even at 24x acceleration.

Conclusion: The open-access MMCMR-427K database and CardioMM framework establish a scalable pathway toward high-throughput, high-quality, and clinically accessible cardiovascular imaging, enabling substantial increases in CMR examination throughput without compromising clinical integrity.

Abstract: Multimodal cardiovascular magnetic resonance (CMR) imaging provides comprehensive and non-invasive insights into cardiovascular disease (CVD) diagnosis and underlying mechanisms. Despite decades of advancements, its widespread clinical adoption remains constrained by prolonged scan times and heterogeneity across medical environments. This underscores the urgent need for a generalist reconstruction foundation model for ultra-fast CMR imaging, one capable of adapting across diverse imaging scenarios and serving as the essential substrate for all downstream analyses. To enable this goal, we curate MMCMR-427K, the largest and most comprehensive multimodal CMR k-space database to date, comprising 427,465 multi-coil k-space data paired with structured metadata across 13 international centers, 12 CMR modalities, 15 scanners, and 17 CVD categories in populations across three continents. Building on this unprecedented resource, we introduce CardioMM, a generalist reconstruction foundation model capable of dynamically adapting to heterogeneous fast CMR imaging scenarios. CardioMM unifies semantic contextual understanding with physics-informed data consistency to deliver robust reconstructions across varied scanners, protocols, and patient presentations. Comprehensive evaluations demonstrate that CardioMM achieves state-of-the-art performance in the internal centers and exhibits strong zero-shot generalization to unseen external settings. Even at imaging acceleration up to 24x, CardioMM reliably preserves key cardiac phenotypes, quantitative myocardial biomarkers, and diagnostic image quality, enabling a substantial increase in CMR examination throughput without compromising clinical integrity. Together, our open-access MMCMR-427K database and CardioMM framework establish a scalable pathway toward high-throughput, high-quality, and clinically accessible cardiovascular imaging.

[338] RT-Focuser: A Real-Time Lightweight Model for Edge-side Image Deblurring

Zhuoyu Wu, Wenhui Ou, Qiawei Zheng, Jiayan Yang, Quanjun Wang, Wenqi Fang, Zheng Wang, Yongkui Yang, Heshan Li

Main category: eess.IV

TL;DR: RT-Focuser is a lightweight U-shaped network for real-time motion deblurring that achieves 30.67 dB PSNR with only 5.85M parameters and 15.76 GMACs, running at 6ms per frame on GPU and mobile devices.

DetailsMotivation: Motion blur from camera or object movement degrades image quality and creates challenges for real-time applications like autonomous driving, UAV perception, and medical imaging that require fast processing.

Method: A lightweight U-shaped network with three key components: Lightweight Deblurring Block (LD) for edge-aware feature extraction, Multi-Level Integrated Aggregation module (MLIA) for encoder integration, and Cross-source Fusion Block (X-Fuse) for progressive decoder refinement.

Result: Achieves 30.67 dB PSNR with only 5.85M parameters and 15.76 GMACs, runs 6ms per frame on GPU and mobile, exceeding 140 FPS on both platforms, demonstrating strong edge deployment potential.

Conclusion: RT-Focuser provides an effective balance between speed and accuracy for real-time deblurring, making it suitable for deployment on edge devices in applications requiring fast image processing.

Abstract: Motion blur caused by camera or object movement severely degrades image quality and poses challenges for real-time applications such as autonomous driving, UAV perception, and medical imaging. In this paper, a lightweight U-shaped network tailored for real-time deblurring is presented and named RT-Focuser. To balance speed and accuracy, we design three key components: Lightweight Deblurring Block (LD) for edge-aware feature extraction, Multi-Level Integrated Aggregation module (MLIA) for encoder integration, and Cross-source Fusion Block (X-Fuse) for progressive decoder refinement. Trained on a single blurred input, RT-Focuser achieves 30.67 dB PSNR with only 5.85M parameters and 15.76 GMACs. It runs 6ms per frame on GPU and mobile, exceeds 140 FPS on both, showing strong potential for deployment on the edge. The official code and usage are available on: https://github.com/ReaganWu/RT-Focuser.

[339] The Color-Clinical Decoupling: Why Perceptual Calibration Fails Clinical Biomarkers in Smartphone Dermatology

Sungwoo Kang

Main category: eess.IV

TL;DR: Color calibration reduces color errors but fails to ensure reliable clinical biomarker measurements across devices, especially for underrepresented skin phototypes, due to “color-clinical decoupling” where perceptual accuracy doesn’t translate to biomarker reliability.

DetailsMotivation: To test whether standard colorimetric calibration ensures clinical reliability for smartphone-based tele-dermatology, particularly for underrepresented skin phototypes (Fitzpatrick III-IV), since this assumption remains untested despite widespread use.

Method: Analyzed 43,425 images from 965 Korean subjects across DSLR, tablet, and smartphone devices. Used Linear Color Correction Matrix (CCM) normalization to reduce color errors, then evaluated biomarker reliability using Individual Typology Angle (ITA) and Melanin Index with inter-device agreement metrics (ICC). Investigated anatomical variance effects on color measurements.

Result: Color correction reduced color error by 67-77% achieving near-clinical accuracy (Delta E < 2.3), but biomarker reliability varied: ITA showed poor inter-device agreement (ICC = 0.40) while Melanin Index achieved good agreement (ICC = 0.77). Facial region accounted for 25.2% of color variance (3.6x greater than device effects at 7.0%), revealing “color-clinical decoupling” phenomenon.

Conclusion: Current colorimetric standards are insufficient for clinical-grade biomarker extraction in mobile dermatology. Single-patch calibration fails due to anatomical variance effects, necessitating region-aware protocols for reliable clinical measurements across devices.

Abstract: Smartphone-based tele-dermatology assumes that colorimetric calibration ensures clinical reliability, yet this remains untested for underrepresented skin phototypes. We investigated whether standard calibration translates to reliable clinical biomarkers using 43,425 images from 965 Korean subjects (Fitzpatrick III-IV) across DSLR, tablet, and smartphone devices. While Linear Color Correction Matrix (CCM) normalization reduced color error by 67-77% – achieving near-clinical accuracy (Delta E < 2.3) – this success did not translate to biomarker reliability. We identify a phenomenon termed “color-clinical decoupling”: despite perceptual accuracy, the Individual Typology Angle (ITA) showed poor inter-device agreement (ICC = 0.40), while the Melanin Index achieved good agreement (ICC = 0.77). This decoupling is driven by the ITA formula’s sensitivity to b* channel noise and is further compounded by anatomical variance. Facial region accounts for 25.2% of color variance – 3.6x greater than device effects (7.0%) – challenging the efficacy of single-patch calibration. Our results demonstrate that current colorimetric standards are insufficient for clinical-grade biomarker extraction, necessitating region-aware protocols for mobile dermatology.

[340] Rewards-based image analysis in microscopy

Kamyar Barakati, Yu Liu, Utkarsh Pratiush, Boris N. Slautin, Sergei V. Kalinin

Main category: eess.IV

TL;DR: The paper discusses reward-based workflows for autonomous scientific imaging analysis, moving beyond traditional expert-designed pipelines and supervised ML toward explainable, unsupervised optimization.

DetailsMotivation: Traditional imaging analysis relies on expert-designed multistep workflows or supervised ML methods that require significant human intervention (hyperparameter tuning, data labeling). There's a need for more autonomous approaches that can guide algorithms toward optimal data representations for decision-making.

Method: Reward-based workflows that capture elements of human reasoning and exhibit transferability across tasks. These frameworks shift from supervised black-box models toward explainable, unsupervised optimization, demonstrated through Scanning Probe and Electron Microscopy examples.

Result: Reward-driven approaches enable autonomous scientific imaging analysis with strong transferability across various tasks, supporting classification, regression, structure-property mapping, and hyperspectral data processing.

Conclusion: Reward-based frameworks represent a promising direction for achieving the next level of autonomy in scientific imaging, moving beyond traditional expert-dependent approaches toward explainable, unsupervised optimization that can be applied broadly across scientific domains.

Abstract: Imaging and hyperspectral data analysis is central to progress across biology, medicine, chemistry, and physics. The core challenge lies in converting high-resolution or high-dimensional datasets into interpretable representations that enable insight into the underlying physical or chemical properties of a system. Traditional analysis relies on expert-designed, multistep workflows, such as denoising, feature extraction, clustering, dimensionality reduction, and physics-based deconvolution, or on machine learning (ML) methods that accelerate individual steps. Both approaches, however, typically demand significant human intervention, including hyperparameter tuning and data labeling. Achieving the next level of autonomy in scientific imaging requires designing effective reward-based workflows that guide algorithms toward best data representation for human or automated decision-making. Here, we discuss recent advances in reward-based workflows for image analysis, which capture key elements of human reasoning and exhibit strong transferability across various tasks. We highlight how reward-driven approaches enable a shift from supervised black-box models toward explainable, unsupervised optimization on the examples of Scanning Probe and Electron Microscopies. Such reward-based frameworks are promising for a broad range of applications, including classification, regression, structure-property mapping, and general hyperspectral data processing.

[341] GroundGazer: Camera-based indoor localization of mobile robots with millimeter accuracy at low cost

Sven Hinderer, Jakob Hüsken, Bohan Sun, Bin Yang

Main category: eess.IV

TL;DR: GroundGazer is a low-cost, high-accuracy indoor localization system for robots using a monocular camera and chessboard floor pattern to achieve mm positioning and sub-degree heading accuracy.

DetailsMotivation: Existing high-accuracy indoor localization systems (LiDAR, tachymeters, motion capture) are expensive, while affordable alternatives lack the precision needed for autonomous mobile robots requiring mm-level accuracy.

Method: Uses a monocular (fisheye) camera mounted on the robot, a chessboard pattern on the floor, and optional laser diode. The system analyzes the floor pattern to estimate position and heading with high precision.

Result: Achieves mm-level position accuracy and sub-degree heading accuracy, offering a low-cost alternative to expensive systems while maintaining high precision for autonomous mobile robots.

Conclusion: GroundGazer provides a simple, low-cost, portable, and scalable solution for high-accuracy indoor localization that is robust, easy to set up, and potentially extensible to 3D position and orientation estimation.

Abstract: Highly accurate indoor localization systems with mm positioning accuracy are currently very expensive. They include range finders (such as LiDAR), tachymeters, and motion capture systems relying on multiple high-end cameras. In this work, we introduce a high-accuracy, planar indoor localization system named GroundGazer (GG) for autonomous mobile robots (AMRs). GG estimates the AMR’s position with mm and its heading with sub-degree accuracy. The system requires only a monocular (fisheye) camera, a chessboard floor, and an optional laser diode. Our system is simple and low-cost, easy to set up, portable, robust, scalable to large areas and robot swarms, and potentially extendable to 3D position and orientation estimation.

[342] An on-chip Pixel Processing Approach with 2.4μs latency for Asynchronous Read-out of SPAD-based dToF Flash LiDARs

Yiyang Liu, Rongxuan Zhang, Istvan Gyongy, Alistair Gorman, Sarrah M. Patanwala, Filip Taneski, Robert K. Henderson

Main category: eess.IV

TL;DR: Fully asynchronous peak detection for SPAD-based dToF LiDAR enables pixel-wise event-driven depth acquisition without global sync, reducing latency and motion blur while increasing effective frame rate.

DetailsMotivation: Traditional frame-based LiDAR systems suffer from latency, motion blur, and redundant background data due to global synchronization. There's a need for more efficient, low-latency depth sensing for robotics, autonomous driving, and consumer applications.

Method: Proposes asynchronous peak detection where pixels independently report depth once sufficient signal-to-noise ratio is achieved. Validated with two hardware implementations: offline 256×128 SPAD array with PC processing, and real-time FPGA prototype with 2.4μs latency for on-chip integration.

Result: Demonstrates robust depth estimation, reflectivity reconstruction, and dynamic event-based representation under static/dynamic conditions. Asynchronous operation reduces redundant background data and computational load while remaining tunable via hyperparameters.

Conclusion: Establishes foundation for compact, low-latency, event-driven LiDAR architectures. Also provides semi-closed-form solution for detection probability that benefits both conventional frame-based and proposed asynchronous systems.

Abstract: We propose a fully asynchronous peak detection approach for SPAD-based direct time-of-flight (dToF) flash LiDAR, enabling pixel-wise event-driven depth acquisition without global synchronization. By allowing pixels to independently report depth once a sufficient signal-to-noise ratio is achieved, the method reduces latency, mitigates motion blur, and increases effective frame rate compared to frame-based systems. The framework is validated under two hardware implementations: an offline 256$\times$128 SPAD array with PC based processing and a real-time FPGA proof-of-concept prototype with 2.4$\upmu$s latency for on-chip integration. Experiments demonstrate robust depth estimation, reflectivity reconstruction, and dynamic event-based representation under both static and dynamic conditions. The results confirm that asynchronous operation reduces redundant background data and computational load, while remaining tunable via simple hyperparameters. These findings establish a foundation for compact, low-latency, event-driven LiDAR architectures suited to robotics, autonomous driving, and consumer applications. In addition, we have derived a semi-closed-form solution for the detection probability of the raw-peak finding based LiDAR systems that could benefit both conventional frame-based and proposed asynchronous LiDAR systems.

Daeyoung Kim

Main category: eess.IV

TL;DR: GCVAMD is a novel causal analysis model for Age-Related Macular Degeneration that uses modified CausalVAE to extract latent causal factors from OCT images, enabling causal inference and intervention analysis for AMD risk factors.

DetailsMotivation: Current deep learning methods for AMD detection focus only on prediction performance without understanding underlying causal mechanisms, which limits intervention analysis and reliability. There's a need for models that can identify causal factors like drusen and neovascularization to enable treatment simulation and better clinical decision-making.

Method: GCVAMD incorporates a modified CausalVAE approach that extracts latent causal factors directly from raw OCT images. The model enables causal inference for AMD risk factors, specifically drusen and neovascularization, allowing for treatment simulation and intervention analysis while providing informative latent features for downstream tasks.

Result: GCVAMD successfully identifies drusen status and neovascularization status with AMD causal mechanisms in its latent spaces. These extracted causal factors can be used for various applications including AMD classification/detection and intervention analysis.

Conclusion: GCVAMD represents a significant advancement in AMD analysis by incorporating causality into the detection process, enabling not just prediction but also understanding of underlying mechanisms and supporting clinical decision-making through intervention analysis.

Abstract: Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.

Last updated: 2026-01-21
Built with Hugo, theme modified on Stack