Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 141]
cs.CV [Total: 246]
cs.AI [Total: 91]
cs.SD [Total: 19]
cs.LG [Total: 235]
cs.MA [Total: 4]
cs.MM [Total: 1]
eess.AS [Total: 7]
eess.IV [Total: 18]

cs.CL

[1] The Qualitative Laboratory: Theory Prototyping and Hypothesis Generation with Large Language Models

Hugues Draelants

Main category: cs.CL

TL;DR: Sociological persona simulation using LLMs as a “qualitative laboratory” to generate rich hypotheses about how social groups interpret information, overcoming limitations of existing methods.

Details

Motivation: Social science needs better methods to generate nuanced qualitative hypotheses about diverse social groups' interpretations of new information. Existing methods like vignette surveys lack discursive depth, while rule-based agent-based models face formalization bottlenecks.

Method: Sociological persona simulation using Large Language Models (LLMs) - creating personas derived from sociological theories that react to policy messages through naturalistic discourse generation.

Result: The simulation produced nuanced and counter-intuitive hypotheses, such as a conservative persona rejecting a national security frame for climate policy, challenging theoretical assumptions.

Conclusion: LLM-based persona simulation represents a superior tool for generating deeply textured hypotheses when used in a “simulation then validation” workflow for subsequent empirical testing.

Abstract: A central challenge in social science is to generate rich qualitative hypotheses about how diverse social groups might interpret new information. This article introduces and illustrates a novel methodological approach for this purpose: sociological persona simulation using Large Language Models (LLMs), which we frame as a “qualitative laboratory”. We argue that for this specific task, persona simulation offers a distinct advantage over established methods. By generating naturalistic discourse, it overcomes the lack of discursive depth common in vignette surveys, and by operationalizing complex worldviews through natural language, it bypasses the formalization bottleneck of rule-based agent-based models (ABMs). To demonstrate this potential, we present a protocol where personas derived from a sociological theory of climate reception react to policy messages. The simulation produced nuanced and counter-intuitive hypotheses - such as a conservative persona’s rejection of a national security frame - that challenge theoretical assumptions. We conclude that this method, used as part of a “simulation then validation” workflow, represents a superior tool for generating deeply textured hypotheses for subsequent empirical testing.

[2] Rate-Distortion Analysis of Compressed Query Delegation with Low-Rank Riemannian Updates

Faruk Alpay, Bugra Kilictas

Main category: cs.CL

TL;DR: Compressed Query Delegation (CQD) is a method that compresses reasoning states into low-rank tensor queries, delegates them to external oracles, and updates via Riemannian optimization to overcome bounded-context limitations in agents.

Details

Motivation: Bounded-context agents fail when intermediate reasoning exceeds working memory limits. The paper addresses this limitation by developing a framework that can effectively delegate compressed queries to external oracles while maintaining reasoning integrity.

Method: CQD involves three steps: (1) compress high-dimensional latent reasoning states into low-rank tensor queries, (2) delegate minimal queries to external oracles, and (3) update latent states via Riemannian optimization on fixed-rank manifolds. The approach is formulated as a constrained stochastic program with query-budget functional and noisy oracle modeling.

Result: Theoretical results show spectral hard-thresholding is optimal for constrained quadratic distortion problems, with convergence guarantees for Riemannian stochastic approximation under bounded noise. Empirical evaluation includes a 2,500-item reasoning suite showing CQD outperforms chain-of-thought baselines, and a human benchmark (N=200) measuring epistemic gain and semantic drift.

Conclusion: CQD provides a principled framework for overcoming bounded-context limitations in reasoning agents through compressed query delegation, with both theoretical guarantees and empirical validation across synthetic and human benchmarks.

Abstract: Bounded-context agents fail when intermediate reasoning exceeds an effective working-memory budget. We study compressed query delegation (CQD): (i) compress a high-dimensional latent reasoning state into a low-rank tensor query, (ii) delegate the minimal query to an external oracle, and (iii) update the latent state via Riemannian optimization on fixed-rank manifolds. We give a math-first formulation: CQD is a constrained stochastic program with a query-budget functional and an oracle modeled as a noisy operator. We connect CQD to classical rate-distortion and information bottleneck principles, showing that spectral hard-thresholding is optimal for a natural constrained quadratic distortion problem, and we derive convergence guarantees for Riemannian stochastic approximation under bounded oracle noise and smoothness assumptions. Empirically, we report (A) a 2,500-item bounded-context reasoning suite (BBH-derived tasks plus curated paradox instances) comparing CQD against chain-of-thought baselines under fixed compute and context; and (B) a human “cognitive mirror” benchmark (N=200) measuring epistemic gain and semantic drift across modern oracles.

[3] Intention Collapse: Intention-Level Metrics for Reasoning in Language Models

Patricio Vera

Main category: cs.CL

TL;DR: The paper introduces “intention collapse” - the compression of rich internal states into token sequences - and proposes metrics to study how inference-time computation shapes internal intentions before verbalization in language models.

Details

Motivation: Language generation involves compressing high-dimensional internal states into single token sequences, but current evaluation focuses only on final outputs. The paper aims to develop methods to study the internal "intentions" before they collapse into language.

Method: Formalizes intention collapse for language models, defines three model-agnostic intention metrics (intention entropy, effective dimensionality, latent knowledge recoverability), and proposes an empirical study framework. Tests with 4-bit Mistral 7B on 200 GSM8K problems comparing direct answer, chain-of-thought, and babble control regimes.

Result: Chain-of-thought raised accuracy from 5.5% to 53%, reduced pre-collapse intention entropy (1.42 to 0.37 bits), and showed higher effective dimensionality despite fewer tokens than babble. Linear probe on internal states achieved AUROC 0.65 in CoT but chance-level in baseline, showing latent information partly lost during collapse.

Conclusion: Intention-level metrics can distinguish inference regimes and expose latent information lost during collapse, but current proxies have limitations. The framework enables studying how computation shapes internal intentions before verbalization.

Abstract: Every act of language generation compresses a rich internal state into a single token sequence. We call this process intention collapse: a many-to-one projection from a high dimensional intention space I into an external language space L. We formalize intention collapse for contemporary language models, define three simple, model agnostic intention metrics (intention entropy Hint, effective dimensionality dimeff, and latent knowledge recoverability Recov), and propose an empirical agenda for studying how inference time computation shapes internal intentions before they are verbalized. We also report a first small scale experiment. Using a 4 bit Mistral 7B model on 200 GSM8K problems, we compare a direct answer baseline, a chain of thought (CoT) regime, and a babble control. CoT raises accuracy from 5.5 percent to 53 percent, sharply reduces pre collapse intention entropy (from 1.42 to 0.37 bits), and shows higher global effective dimensionality than the other regimes despite producing fewer tokens than babble. At the same time, Hint has little item level predictive power, and a linear probe on I achieves AUROC 0.65 in the CoT regime but only about chance in the baseline regime, where it collapses to the majority class. These preliminary results indicate that intention level metrics can distinguish inference regimes and expose latent information that is partly lost during collapse, while also revealing important limitations of our current proxies

[4] HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery

Shiyuan Liu, Jianwei Wang, Xuemin Lin, Lu Qin, Wenjie Zhang, Ying Zhang

Main category: cs.CL

TL;DR: HyperJoin: LLM-augmented hypergraph framework for joinable table discovery that models structural interactions and ensures result coherence.

Details

Motivation: Existing language model-based methods for joinable table discovery insufficiently account for structural interactions - they model tables as isolated/pairwise columns offline and ignore candidate interactions online, leading to incoherent results.

Method: 1) Construct hypergraph with intra-table and LLM-augmented inter-table hyperedges; 2) Formulate task as hypergraph link prediction; 3) Design HIN (Hierarchical Interaction Network) for column representation learning via bidirectional message passing; 4) Cast online ranking as coherence-aware top-k selection with reranking using maximum spanning tree algorithm.

Result: HyperJoin achieves average improvements of 21.4% (Precision@15) and 17.2% (Recall@15) over the best baseline.

Conclusion: HyperJoin effectively addresses structural interaction limitations in joinable table discovery through hypergraph modeling and coherence-aware ranking, demonstrating significant performance improvements.

Abstract: As a pivotal task in data lake management, joinable table discovery has attracted widespread interest. While existing language model-based methods achieve remarkable performance by combining offline column representation learning with online ranking, their design insufficiently accounts for the underlying structural interactions: (1) offline, they directly model tables into isolated or pairwise columns, thereby struggling to capture the rich inter-table and intra-table structural information; and (2) online, they rank candidate columns based solely on query-candidate similarity, ignoring the mutual interactions among the candidates, leading to incoherent result sets. To address these limitations, we propose HyperJoin, a large language model (LLM)-augmented Hypergraph framework for Joinable table discovery. Specifically, we first construct a hypergraph to model tables using both the intra-table hyperedges and the LLM-augmented inter-table hyperedges. Consequently, the task of joinable table discovery is formulated as link prediction on this constructed hypergraph. We then design HIN, a Hierarchical Interaction Network that learns expressive column representations through bidirectional message passing over columns and hyperedges. To strengthen coherence and internal consistency in the result columns, we cast online ranking as a coherence-aware top-k column selection problem. We then introduce a reranking module that leverages a maximum spanning tree algorithm to prune noisy connections and maximize coherence. Experiments demonstrate the superiority of HyperJoin, achieving average improvements of 21.4% (Precision@15) and 17.2% (Recall@15) over the best baseline.

[5] Multi-Dimensional Prompt Chaining to Improve Open-Domain Dialogue Generation

Livia Leong Hui Teng

Main category: cs.CL

TL;DR: A multi-dimensional prompt-chaining framework improves open-domain dialogue quality in small language models, making them competitive with much larger models.

Details

Motivation: Small language models (SLMs) have deployment advantages but struggle to match larger models' dialogue quality in open-domain settings. There's a need for resource-efficient methods to enhance SLM performance without scaling model size.

Method: Proposed a multi-dimensional prompt-chaining framework integrating Naturalness, Coherence, and Engagingness dimensions. Applied to TinyLlama and Llama-2-7B, benchmarked against larger models (Llama-2-70B, GPT-3.5 Turbo). Used automatic and human evaluation for diversity, contextual coherence, and overall quality assessment.

Result: The full framework improved response diversity by up to 29%, contextual coherence by up to 28%, and engagingness/naturalness by up to 29%. Llama-2-7B achieved performance comparable to substantially larger models like Llama-2-70B and GPT-3.5 Turbo.

Conclusion: Carefully designed prompt-based strategies provide an effective and resource-efficient pathway to improving open-domain dialogue quality in SLMs, enabling them to compete with much larger models.

Abstract: Small language models (SLMs) offer significant deployment advantages but often struggle to match the dialogue quality of larger models in open-domain settings. In this paper, we propose a multi-dimensional prompt-chaining framework that integrates Naturalness, Coherence, and Engagingness dimensions to enhance human-likeness in open-domain dialogue generation. We apply the framework to two SLMs, TinyLlama and Llama-2-7B, and benchmark their performance against responses generated by substantially larger models, including Llama-2-70B and GPT-3.5 Turbo. We then employ automatic and human evaluation to assess the responses based on diversity, contextual coherence, as well as overall quality. Results show that the full framework improves response diversity by up to 29%, contextual coherence by up to 28%, and engagingness as well as naturalness by up to 29%. Notably, Llama-2-7B achieves performance comparable to substantially larger models, including Llama-2-70B and GPT-3.5 Turbo. Overall, the findings demonstrate that carefully designed prompt-based strategies provide an effective and resource-efficient pathway to improving open-domain dialogue quality in SLMs.

[6] KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs

Yixuan Tang, Yi Yang

Main category: cs.CL

TL;DR: KV-Embedding activates frozen LLMs’ latent representation power by re-routing final token KV states as prefixes, enabling sequence-level context access in single forward pass without training.

Details

Motivation: LLMs have structural limitations in training-free settings: causal attention restricts early tokens from accessing subsequent context, and next-token prediction objective biases representations toward generation rather than semantic compression.

Method: Proposes KV-Embedding framework that leverages observation that key-value (KV) states of final token at each layer encode compressed sequence view. Re-routes these states as prepended prefix to enable all tokens to access sequence-level context. Introduces automated layer selection strategy based on intrinsic dimensionality for model-agnostic applicability.

Result: Outperforms existing training-free baselines by up to 10% on MTEB across Qwen, Mistral, and Llama backbones. Maintains robust performance on sequences up to 4,096 tokens.

Conclusion: Internal state manipulation offers efficient alternative to input modification for representation learning. This work encourages further exploration of LLM internals for representation learning.

Abstract: While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence. By re-routing these states as a prepended prefix, we enable all tokens to access sequence-level context within a single forward pass. To ensure model-agnostic applicability, we introduce an automated layer selection strategy based on intrinsic dimensionality. Evaluations on MTEB across Qwen, Mistral, and Llama backbones show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens. These results demonstrate that internal state manipulation offers an efficient alternative to input modification, and we hope this work encourages further exploration of LLM internals for representation learning.

[7] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long

Main category: cs.CL

TL;DR: Enhanced LLM-based ASR framework with cross-attention fusion of Whisper and mHuBERT encoders achieves competitive results on MLC-SLM Challenge but still underperforms fine-tuned E2E Whisper models.

Details

Motivation: Address limitations of previous SHNU-mASR system: simple feature concatenation fails to fully exploit complementary information, and performance gap between LLM-based ASR and E2E encoder-decoder ASR remains unexplored.

Method: Enhanced LLM-based ASR framework combining fine-tuned Whisper and mHuBERT encoders with LLM. Evaluated E2E Whisper models with LoRA and full fine-tuning, then proposed cross-attention-based fusion mechanisms for parallel-speech-encoder architecture.

Result: Achieved CER/WER of 10.69% on MLC-SLM Challenge evaluation set, ranking on par with top Track 1 systems despite using only 1,500 hours of training data vs. competitors’ large-scale datasets.

Conclusion: LLM-based ASR still doesn’t match performance of fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Code is publicly available.

Abstract: The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.

[8] Unsupervised Text Style Transfer for Controllable Intensity

Shuhuan Gu, Wenbiao Tao, Xinchen Ma, Kangkang He, Ye Guo, Xiang Li, Yunshi Lan

Main category: cs.CL

TL;DR: This paper proposes a two-stage SFT-then-PPO approach for unsupervised text style transfer with controllable intensity, using synthesized parallel data for fine-tuning followed by PPO with hierarchical reward functions to distinguish subtle intensity differences.

Details

Motivation: Unsupervised Text Style Transfer (UTST) with controllable intensity is challenging due to subtle differences in stylistic features across intensity levels and the lack of parallel data. Existing approaches struggle with distinguishing adjacent intensity levels.

Method: Proposes a SFT-then-PPO paradigm: 1) Fine-tune LLM with synthesized parallel data, 2) Further train with PPO using hierarchical reward functions that consider both global and local stylistic features to distinguish intensity levels.

Result: Experiments on two UTST benchmarks show that both reward functions have advantages and applying them to LLM fine-tuning effectively improves performance across various metrics. The method produces noticeable stylistic differences even between close intensity levels.

Conclusion: The proposed SFT-then-PPO approach with hierarchical reward functions successfully addresses the challenges of UTST with controllable intensity, enabling effective style transfer with subtle intensity distinctions without requiring parallel data.

Abstract: Unsupervised Text Style Transfer (UTST) aims to build a system to transfer the stylistic properties of a given text without parallel text pairs. Compared with text transfer between style polarities, UTST for controllable intensity is more challenging due to the subtle differences in stylistic features across different intensity levels. Faced with the challenges posed by the lack of parallel data and the indistinguishability between adjacent intensity levels, we propose a SFT-then-PPO paradigm to fine-tune an LLM. We first fine-tune the LLM with synthesized parallel data. Then, we further train the LLM with PPO, where the rewards are elaborately designed for distinguishing the stylistic intensity in hierarchical levels. Both the global and local stylistic features are considered to formulate the reward functions. The experiments on two UTST benchmarks showcase that both rewards have their advantages and applying them to LLM fine-tuning can effectively improve the performance of an LLM backbone based on various evaluation metrics. Even for close levels of intensity, we can still observe the noticeable stylistic difference between the generated text.

[9] Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation

Steffen Freisinger, Philipp Seeberger, Thomas Ranzenberger, Tobias Bocklet, Korbinian Riedhammer

Main category: cs.CL

TL;DR: Novel hierarchical topic segmentation for speech transcripts using LLMs with zero-shot prompting and LoRA fine-tuning, incorporating speech pause features, evaluated on multilingual datasets with improved hierarchical metrics.

Details

Motivation: Segmenting speech transcripts into thematic sections benefits downstream processing and accessibility for users who rely on written text. Current approaches lack hierarchical structure that captures both topic and subtopic boundaries.

Method: Introduces hierarchical topic segmentation generating multi-level tables of contents. Compares zero-shot prompting and LoRA fine-tuning on large language models, and explores integration of high-level speech pause features.

Result: Evaluations on English meeting recordings and multilingual lecture transcripts (Portuguese, German) show significant improvements over established topic segmentation baselines. Adapts evaluation measure for multi-level segmentation considering all hierarchical levels.

Conclusion: The approach successfully creates hierarchical topic segmentation for speech transcripts, outperforming existing methods and providing better structure for accessibility and downstream processing tasks.

Abstract: Segmenting speech transcripts into thematic sections benefits both downstream processing and users who depend on written text for accessibility. We introduce a novel approach to hierarchical topic segmentation in transcripts, generating multi-level tables of contents that capture both topic and subtopic boundaries. We compare zero-shot prompting and LoRA fine-tuning on large language models, while also exploring the integration of high-level speech pause features. Evaluations on English meeting recordings and multilingual lecture transcripts (Portuguese, German) show significant improvements over established topic segmentation baselines. Additionally, we adapt a common evaluation measure for multi-level segmentation, taking into account all hierarchical levels within one metric.

[10] Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

Main category: cs.CL

TL;DR: Researchers demonstrate how colluding AI agents can manipulate victim beliefs using only truthful evidence fragments, exploiting LLMs’ reasoning capabilities to create deceptive narratives without falsifying information.

Details

Motivation: As LLMs become autonomous agents processing real-time information, their reasoning capabilities create new attack surfaces. The paper aims to explore how colluding agents can manipulate beliefs using only truthful evidence, without traditional attack methods like backdoors or falsified documents.

Method: Introduces Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments. Developed CoPHEME dataset from real-world rumor events and simulated attacks across 14 LLM families to study vulnerability.

Result: Attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success. False beliefs cascade to downstream judges with over 60% deception rates.

Conclusion: LLM-based agents are vulnerable to cognitive collusion attacks where truthful evidence fragments can be weaponized to create false beliefs. This reveals a socio-technical vulnerability in how AI agents interact with dynamic information environments, with stronger reasoning capabilities paradoxically increasing risk.

Abstract: As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs’ overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

[11] ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

Haq Nawaz Malik

Main category: cs.CL

TL;DR: This paper introduces KS-LIT-3M, a 3.1 million word corpus for Kashmiri language model pretraining, addressing data scarcity by converting legacy InPage-formatted literature to Unicode.

Details

Motivation: LLMs perform poorly on Kashmiri due to lack of high-quality training data, despite it being spoken by ~7 million people. Decades of Kashmiri literature remain inaccessible in proprietary InPage format.

Method: Developed specialized InPage-to-Unicode converter, then applied rigorous preprocessing including English contamination removal, character normalization, and quality validation to create a continuous linear text stream optimized for causal LM training.

Result: Created KS-LIT-3M corpus with 3.1M words (16.4M characters) from 131,607 unique words across diverse genres (literature, journalism, academic texts, religious scholarship), released under CC-BY-4.0 license.

Conclusion: The dataset addresses a fundamental resource gap for Kashmiri NLP, enabling better language model performance and facilitating research in Kashmiri language technology.

Abstract: Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a specialized InPage-to-Unicode converter, followed by rigorous preprocessing including English contamination removal, character normalization, and quality validation. Encompassing 131,607 unique words drawn from diverse genres including literary works, journalistic writing, academic texts, and religious scholarship, KS-LIT-3M addresses a fundamental resource gap for Kashmiri language technology. The dataset is released under the CC-BY-4.0 license to facilitate research in Kashmiri natural language processing.

[12] EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation

Zilin Li, Weiwei Xu, Xuanbo Lu, Zheda Liu

Main category: cs.CL

TL;DR: EmoLoom-2B is a lightweight pipeline that transforms small language models (<2B params) into efficient screening tools for joint emotion classification and VAD prediction, featuring reproducible protocols and semantic regularizers.

Details

Motivation: To create a budget-friendly, reproducible, and auditable screening pipeline for emotion analysis that can serve as a reliable first pass before more resource-intensive training or multimodal approaches, while ensuring fair evaluation through standardized protocols.

Method: Uses a unified JSON I/O contract for data loading/training/inference with KV-off decoding; incorporates VAD-preserving constraints and external appraisal classifier for semantic regularization; employs Valence Flip augmentation for polarity sensitivity; applies A/B mixture sampling with entropy-aware temperature scheduling during fine-tuning.

Result: Using Qwen-1.8B-Chat as base, achieves strong performance on GoEmotions and EmpatheticDialogues, and demonstrates robust cross-corpus generalization on DailyDialog, showing the pipeline’s effectiveness as a screening tool.

Conclusion: EmoLoom-2B provides a dependable, budget-aware, and re-entrant screening solution for emotion analysis tasks, offering a reliable first pass that can be audited and reproduced before committing to heavier training or multimodal fusion approaches.

Abstract: We introduce EmoLoom-2B, a lightweight and reproducible pipeline that turns small language models under 2B parameters into fast screening candidates for joint emotion classification and Valence-Arousal-Dominance prediction. To ensure protocol-faithful and fair evaluation, we unify data loading, training, and inference under a single JSON input-output contract and remove avoidable variance by adopting KV-off decoding as the default setting. We incorporate two orthogonal semantic regularizers: a VAD-preserving constraint that aligns generated text with target VAD triples, and a lightweight external appraisal classifier that provides training-time guidance on goal attainment, controllability, certainty, and fairness without injecting long rationales. To improve polarity sensitivity, we introduce Valence Flip augmentation based on mirrored emotional pairs. During supervised fine-tuning, we apply A/B mixture sampling with entropy-aware temperature scheduling to balance coverage and convergence. Using Qwen-1.8B-Chat as the base model, EmoLoom-2B achieves strong performance on GoEmotions and EmpatheticDialogues, and demonstrates robust cross-corpus generalization on DailyDialog. The proposed recipe is budget-aware, auditable, and re-entrant, serving as a dependable screening pass before heavier training or multimodal fusion.

[13] ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

Omer Nacar, Serry Sibaee, Adel Ammar, Yasser Alhabashi, Nadia Samer Sibai, Yara Farouk Ahmed, Ahmed Saud Alqusaiyer, Sulieman Mahmoud AlMahmoud, Abdulrhman Mamdoh Mukhaniq, Lubaba Raed, Sulaiman Mohammed Alatwah, Waad Nasser Alqahtani, Yousif Abdulmajeed Alnasser, Mohamed Aziz Khadraoui, Wadii Boulila

Main category: cs.CL

TL;DR: ARCADE is the first Arabic speech dataset with city-level dialect granularity, containing 3,790 audio segments from 58 cities across 19 countries, annotated for dialect, emotion, and speech type.

Details

Motivation: Arabic has rich regional dialects with substantial phonetic and lexical differences, but existing datasets lack fine-grained dialect mapping at the city level, limiting research on dialect identification and analysis.

Method: Collected Arabic radio speech from streaming services across the Arab world, capturing 30-second segments. Implemented a pipeline with native Arabic reviewers (1-3 per clip) to annotate emotion, speech type, dialect category, and validity flags. Resulted in 6,907 annotations for 3,790 unique audio segments.

Result: Created ARCADE corpus with 3,790 audio segments spanning 58 cities across 19 countries, providing fine-grained annotations for city-level dialect tagging. The dataset enables robust multi-task learning and serves as a benchmark for dialect identification tasks.

Conclusion: ARCADE addresses the gap in fine-grained Arabic dialect resources by providing the first city-level speech dataset, facilitating advanced research in dialect identification, multi-task learning, and Arabic speech analysis.

Abstract: The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type, dialect category, and a validity flag for dialect identification tasks. The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries. These fine-grained annotations enable robust multi-task learning, serving as a benchmark for city-level dialect tagging. We detail the data collection methodology, assess audio quality, and provide a comprehensive analysis of label distributions. The dataset is available on: https://huggingface.co/datasets/riotu-lab/ARCADE-full

[14] Listen, Attend, Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels

Yacouba Diarra, Michael Leventhal

Main category: cs.CL

TL;DR: LAU is a semantic regularization technique for end-to-end speech translation that uses frozen text embeddings to constrain acoustic encoder’s latent space, improving semantic preservation without increasing inference cost.

Details

Motivation: End-to-end speech translation suffers from slow convergence and poor performance when target transcriptions have high variance and semantic ambiguity, especially with scarce/noisy training data.

Method: Propose Listen, Attend, Understand (LAU) - uses frozen text embeddings to provide directional auxiliary loss that injects linguistic groundedness into acoustic representations during training.

Result: LAU achieves comparable performance to E2E-ST system pretrained with 100% more data, while better preserving semantic meaning. Introduces Total Parameter Drift metric showing semantic constraints reorganize encoder weights to prioritize meaning over phonetics.

Conclusion: LAU is a robust alternative to post-hoc rescoring and valuable addition to E2E-ST training, especially with scarce/noisy data, without increasing inference cost.

Abstract: End-to-End Speech Translation often shows slower convergence and worse performance when target transcriptions exhibit high variance and semantic ambiguity. We propose Listen, Attend, Understand (LAU), a semantic regularization technique that constrains the acoustic encoder’s latent space during training. By leveraging frozen text embeddings to provide a directional auxiliary loss, LAU injects linguistic groundedness into the acoustic representation without increasing inference cost. We evaluate our method on a Bambara-to-French dataset with 30 hours of Bambara speech translated by non-professionals. Experimental results demonstrate that LAU models achieve comparable performance by standard metrics compared to an E2E-ST system pretrained with 100% more data and while performing better in preserving semantic meaning. Furthermore, we introduce Total Parameter Drift as a metric to quantify the structural impact of regularization to demonstrate that semantic constraints actively reorganize the encoder’s weights to prioritize meaning over literal phonetics. Our findings suggest that LAU is a robust alternative to post-hoc rescoring and a valuable addition to E2E-ST training, especially when training data is scarce and/or noisy.

[15] MIND Your Reasoning: A Meta-Cognitive Intuitive-Reflective Network for Dual-Reasoning in Multimodal Stance Detection

Bingbing Wang, Zhengda Jin, Bin Liang, Wenjie Li, Jing Li, Ruifeng Xu, Min Zhang

Main category: cs.CL

TL;DR: MIND introduces a meta-cognitive dual-reasoning network that shifts from learning to fuse to learning to reason for multimodal stance detection, outperforming baselines on MMSD benchmark.

Details

Motivation: Existing multimodal stance detection methods focus on learning to fuse modalities but lack explicit reasoning processes to understand how inter-modal dynamics (irony, conflict) shape final stance, leading to frequent misjudgments.

Method: MIND (Meta-cognitive Intuitive-reflective Network for Dual-reasoning) implements a self-improving loop inspired by dual-process theory: 1) intuitive stage generates rapid hypothesis by querying evolving Modality and Semantic Experience Pools, 2) meta-cognitive reflective stage uses Modality-CoT and Semantic-CoT to scrutinize initial judgment, distill adaptive strategies, and evolve experience pools.

Result: Extensive experiments on MMSD benchmark demonstrate MIND significantly outperforms most baseline models and exhibits strong generalization.

Conclusion: The paradigm shift from learning to fuse to learning to reason, implemented through MIND’s dual-reasoning approach with evolving experience pools, enables more robust and context-aware stance detection by explicitly modeling inter-modal dynamics.

Abstract: Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing methods predominantly operate by learning to fuse modalities. They lack an explicit reasoning process to discern how inter-modal dynamics, such as irony or conflict, collectively shape the user’s final stance, leading to frequent misjudgments. To address this, we advocate for a paradigm shift from learning to fuse to learning to reason. We introduce MIND, a Meta-cognitive Intuitive-reflective Network for Dual-reasoning. Inspired by the dual-process theory of human cognition, MIND operationalizes a self-improving loop. It first generates a rapid, intuitive hypothesis by querying evolving Modality and Semantic Experience Pools. Subsequently, a meta-cognitive reflective stage uses Modality-CoT and Semantic-CoT to scrutinize this initial judgment, distill superior adaptive strategies, and evolve the experience pools themselves. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the MMSD benchmark demonstrate that our MIND significantly outperforms most baseline models and exhibits strong generalization.

[16] RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution

Andrew Borthwick, Stephen Ash

Main category: cs.CL

TL;DR: AI agents autonomously evolve Text-to-SQL systems through survival-of-the-fittest evolution cycles, discovering effective techniques without domain guidance, achieving significant performance gains especially on cheaper models.

Details

Motivation: To enable AI systems to autonomously conduct research and improve their own capabilities without human intervention, specifically for Text-to-SQL tasks, starting from minimal human-provided baselines.

Method: RoboPhD uses a closed-loop evolution cycle with two coordinated agents: SQL Generation agent (database analysis + SQL generation) and Evolution agent that designs new versions based on performance feedback. Features ELO-based selection mechanism for survival-of-the-fittest dynamics and handles non-transitivity in performance.

Result: Evolved agents from 70 to 1500 lines over 18 iterations, autonomously discovered strategies like size-adaptive database analysis and SQL generation patterns. Achieved 73.67% accuracy on BIRD test set. Biggest gains on cheaper models: 8.9 points improvement over Claude Haiku, enabling ‘skip a tier’ deployment where evolved cheaper models outperform naive expensive ones.

Conclusion: AI can autonomously build strong agentic systems with only trivial human starting points, demonstrating the potential for autonomous AI research and optimization, particularly enabling cost-effective deployment through evolved cheaper models outperforming naive expensive ones.

Abstract: We present RoboPhD, a system where AI agents autonomously conduct research to improve Text-to-SQL performance. RoboPhD implements a closed-loop evolution cycle with two coordinated components: a SQL Generation agent composed of a database analysis script and SQL generation instructions, and an Evolution agent that designs new versions based on performance feedback. Central to the framework is an ELO-based selection mechanism enabling survival-of-the-fittest dynamics while handling non-transitivity in performance. Starting from a naive 70-line baseline, RoboPhD evolves agents through iterative cross-pollination, discovering effective techniques without any external guidance on the Text-to-SQL domain. Our best agent, evolved to 1500 lines over 18 iterations, autonomously discovered strategies such as size-adaptive database analysis that adjusts depth based on schema complexity and SQL generation patterns for column selection, evidence interpretation, and aggregation. Evolution provides the largest gains on cheaper models: while we improve by 2.3 points over a strong Claude Opus 4.5 naive baseline, we show an improvement of 8.9 points over the weaker Claude Haiku model. This enables ‘skip a tier’ deployment: evolved Haiku exceeds naive Sonnet accuracy, and evolved Sonnet exceeds naive Opus, both at lower cost. The full system achieves 73.67% accuracy on the BIRD test set, demonstrating that AI can autonomously build a strong agentic system with only a trivial human-provided starting point.

[17] KOS-TL (Knowledge Operation System Type Logic)

Peng Chen

Main category: cs.CL

TL;DR: KOS-TL is a constructive framework using Dependent Type Theory to unify data, logic, and proof for autonomous knowledge systems, featuring three hierarchical layers and formal operational semantics with proven meta-theoretical properties.

Details

Motivation: To bridge the gap between static symbolic logic and dynamic system execution in knowledge representation, providing a rigorous logical foundation for autonomous and executable knowledge systems.

Method: Leverages Dependent Type Theory with three hierarchical layers: Core Layer (static type universe), Kernel Layer (event-driven state evolution via ⟨Σ, Ev, Δ⟩), and Runtime Layer (bidirectional refinement of physical signals into logical evidence). Integrates Davidsonian event semantics with Martin-Löf type theory.

Result: Formally defines operational semantics and proves key meta-theoretical properties including Progress and Evolutionary Consistency, ensuring logical self-consistency and freedom from stuck states during continuous transitions. Enables “proof-carrying knowledge” where state changes have formal validity witnesses.

Conclusion: KOS-TL provides a robust, formally verifiable basis for next-generation intelligent autonomous operating systems, demonstrated through applications in industrial traceability and cross-border financial compliance.

Abstract: This paper introduces KOS-TL (Knowledge Operation System Type Logic), a novel constructive framework designed to provide a rigorous logical foundation for autonomous and executable knowledge systems. Traditional knowledge representation models often suffer from a gap between static symbolic logic and dynamic system execution. To bridge this divide, KOS-TL leverages Dependent Type Theory to unify data, logic, and proof into a singular computational substrate.The architecture of KOS-TL is organized into three hierarchical layers: the Core Layer, which defines the static type universe and constructive primitives; the Kernel Layer, which governs state evolution through an event-driven mechanism characterized by the triple $\langle Σ, \textsf{Ev}, Δ\rangle$; and the Runtime Layer, responsible for the bidirectional refinement of physical signals into logical evidence. We formally define the operational semantics of the system and prove key meta-theoretical properties, including Progress and Evolutionary Consistency, ensuring that the system remains logically self-consistent and free from stuck states during continuous state transitions.By integrating Davidsonian event semantics with Martin-Löf type theory, KOS-TL enables the construction of “proof-carrying knowledge,” where every state change in the knowledge base is accompanied by a formal witness of its validity. We demonstrate the practical utility of this logic through application examples in industrial traceability and cross-border financial compliance. Our results suggest that KOS-TL provides a robust, formally verifiable basis for the next generation of intelligent, autonomous operating systems.

[18] SongSage: A Large Musical Language Model with Lyric Generative Pre-training

Jiani Guo, Jiajia Li, Jie Wu, Zuchao Li, Yujiu Yang, Ping Wang

Main category: cs.CL

TL;DR: SongSage is a large musical language model trained on 5.48B lyrical tokens that excels at lyric-centric tasks like playlist understanding, lyric generation, and query rewriting for music recommendations, while maintaining general knowledge capabilities.

Details

Motivation: Current LLMs have limited understanding of lyric-centric knowledge and playlist comprehension, despite their success in other domains. There's a need for specialized models that can understand musical content, playlists, and user intents related to music.

Method: 1) Created PlaylistSense dataset to evaluate playlist understanding; 2) Developed SongSage through continual pretraining on LyricBank (5.48B lyrical tokens); 3) Fine-tuned with LyricBank-SFT (775k samples across 9 lyric-centric tasks); 4) Evaluated on playlist understanding, lyric generation, and general knowledge tasks.

Result: SongSage demonstrates strong lyric-centric knowledge, excels at rewriting user queries for zero-shot playlist recommendations, generates/continues lyrics effectively, performs well across 7 additional capabilities, and maintains competitive general knowledge (MMLU score).

Conclusion: SongSage successfully addresses the gap in lyric-centric LLM understanding, providing a specialized model for music AI applications while preserving general knowledge. The model and training script will be released for reproducibility, though datasets remain restricted due to copyright.

Abstract: Large language models have achieved significant success in various domains, yet their understanding of lyric-centric knowledge has not been fully explored. In this work, we first introduce PlaylistSense, a dataset to evaluate the playlist understanding capability of language models. PlaylistSense encompasses ten types of user queries derived from common real-world perspectives, challenging LLMs to accurately grasp playlist features and address diverse user intents. Comprehensive evaluations indicate that current general-purpose LLMs still have potential for improvement in playlist understanding. Inspired by this, we introduce SongSage, a large musical language model equipped with diverse lyric-centric intelligence through lyric generative pretraining. SongSage undergoes continual pretraining on LyricBank, a carefully curated corpus of 5.48 billion tokens focused on lyrical content, followed by fine-tuning with LyricBank-SFT, a meticulously crafted instruction set comprising 775k samples across nine core lyric-centric tasks. Experimental results demonstrate that SongSage exhibits a strong understanding of lyric-centric knowledge, excels in rewriting user queries for zero-shot playlist recommendations, generates and continues lyrics effectively, and performs proficiently across seven additional capabilities. Beyond its lyric-centric expertise, SongSage also retains general knowledge comprehension and achieves a competitive MMLU score. We will keep the datasets inaccessible due to copyright restrictions and release the SongSage and training script to ensure reproducibility and support music AI research and applications, the datasets release plan details are provided in the appendix.

[19] DHI: Leveraging Diverse Hallucination Induction for Enhanced Contrastive Factuality Control in Large Language Models

Jiani Guo, Xiangke Zeng, Jie Wu, Zuchao Li

Main category: cs.CL

TL;DR: DHI is a new training framework that enables Evil LLMs to generate diverse hallucinations without annotated data, using modified loss functions and adaptive constraints to improve hallucination mitigation.

Details

Motivation: Current hallucination mitigation approaches are limited because Evil LLMs trained on specific error types only reproduce those patterns, lacking diversity in hallucination generation, which restricts overall effectiveness.

Method: DHI uses a modified loss function that down-weights factually correct tokens to encourage diverse hallucinations at targeted positions while maintaining overall factual content. It includes causal attention masking to reduce penalization impact on subsequent tokens, and adaptive rationality constraints during inference that restrict contrastive decoding to high-confidence tokens.

Result: Extensive empirical results show DHI achieves significant performance gains over other contrastive decoding-based approaches across multiple hallucination benchmarks.

Conclusion: DHI effectively addresses the diversity limitation in hallucination generation for contrastive decoding methods, providing a more robust framework for hallucination mitigation in LLMs.

Abstract: Large language models (LLMs) frequently produce inaccurate or fabricated information, known as “hallucinations,” which compromises their reliability. Existing approaches often train an “Evil LLM” to deliberately generate hallucinations on curated datasets, using these induced hallucinations to guide contrastive decoding against a reliable “positive model” for hallucination mitigation. However, this strategy is limited by the narrow diversity of hallucinations induced, as Evil LLMs trained on specific error types tend to reproduce only these particular patterns, thereby restricting their overall effectiveness. To address these limitations, we propose DHI (Diverse Hallucination Induction), a novel training framework that enables the Evil LLM to generate a broader range of hallucination types without relying on pre-annotated hallucination data. DHI employs a modified loss function that down-weights the generation of specific factually correct tokens, encouraging the Evil LLM to produce diverse hallucinations at targeted positions while maintaining overall factual content. Additionally, we introduce a causal attention masking adaptation to reduce the impact of this penalization on the generation of subsequent tokens. During inference, we apply an adaptive rationality constraint that restricts contrastive decoding to tokens where the positive model exhibits high confidence, thereby avoiding unnecessary penalties on factually correct tokens. Extensive empirical results show that DHI achieves significant performance gains over other contrastive decoding-based approaches across multiple hallucination benchmarks.

[20] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Yu-Xiang Lin, Cheng-Han Chiang, Hung-yi Lee

Main category: cs.CL

TL;DR: SLMs suffer from “style amnesia” - they cannot maintain instructed speaking styles (emotion, accent, volume, speed) across multi-turn conversations, even though they can recall the instructions when asked.

Details

Motivation: To investigate whether spoken language models can consistently maintain paralinguistic speaking styles (emotion, accent, volume, speaking speed) throughout multi-turn conversations when explicitly instructed to do so.

Method: Evaluated three proprietary and two open-source SLMs by instructing them to speak in specific styles at conversation start, then testing style maintenance across multiple turns. Also tested style recall ability and various prompting strategies including system vs user messages.

Result: All tested SLMs failed to maintain consistent speaking styles across conversations (style amnesia). Models could recall style instructions when asked but failed to express them. Explicit recall requests partially mitigated the issue. SLMs struggled more with system messages than user messages for style instructions.

Conclusion: Current SLMs have fundamental limitations in maintaining paralinguistic speaking styles across conversations, revealing a gap between instruction recall and style expression. The findings challenge the effectiveness of system prompts for style control and highlight the need for improved architectural approaches.

Abstract: In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.

[21] Almost Clinical: Linguistic properties of synthetic electronic health records

Serge Sharoff, John Baker, David Francis Hunt, Alan Simpson

Main category: cs.CL

TL;DR: LLMs can generate coherent synthetic mental health EHRs with appropriate terminology, but show systematic limitations including register shifts, lack of clinical specificity, and medication/diagnostic inaccuracies.

Details

Motivation: To evaluate the linguistic and clinical suitability of synthetic electronic health records (EHRs) in mental health, understanding how LLMs construct medical authority and patient agency through language.

Method: Created synthetic EHR corpus, then assessed agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals, Care plans) to analyze linguistic choices in constructing medical authority.

Result: LLMs produce coherent, terminology-appropriate texts approximating clinical practice, but show systematic divergences including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures.

Conclusion: While LLMs show promise for generating synthetic mental health EHRs, systematic linguistic and clinical limitations require attention for reliable clinical application.

Abstract: This study evaluates the linguistic and clinical suitability of synthetic electronic health records (EHRs) in the field of mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we assess agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures.

[22] Stylometry Analysis of Human and Machine Text for Academic Integrity

Hezam Albaqami, Muhammad Asif Ayub, Nasir Ahmad, Yaseen Ahmad, Mohammed M. Alqahtani, Abdullah M. Algamdi, Almoaid A. Owaidah, Kashif Ahmad

Main category: cs.CL

TL;DR: NLP framework for academic integrity addressing plagiarism, fabrication, and authorship verification through author attribution and style change detection across four key tasks.

Details

Motivation: Address critical challenges to academic integrity including plagiarism, fabrication, and verification of authorship in educational content, as existing solutions are incomplete and several aspects remain unexplored.

Method: Proposes NLP-based framework targeting four tasks: human vs machine text classification, single vs multi-author differentiation, author change detection in multi-authored documents, and author recognition in collaborative documents. Evaluated on two datasets generated with Gemini using normal and strict prompts.

Result: Performance reduction observed on dataset generated with strict prompt, demonstrating complexities in detecting machine-generated text with cleverly crafted prompts. Generated datasets, code, and materials made publicly available on GitHub.

Conclusion: Provides comprehensive analysis and baseline for future research in academic integrity through NLP-based authorship authentication, highlighting challenges with sophisticated AI-generated content.

Abstract: This work addresses critical challenges to academic integrity, including plagiarism, fabrication, and verification of authorship of educational content, by proposing a Natural Language Processing (NLP)-based framework for authenticating students’ content through author attribution and style change detection. Despite some initial efforts, several aspects of the topic are yet to be explored. In contrast to existing solutions, the paper provides a comprehensive analysis of the topic by targeting four relevant tasks, including (i) classification of human and machine text, (ii) differentiating in single and multi-authored documents, (iii) author change detection within multi-authored documents, and (iv) author recognition in collaboratively produced documents. The solutions proposed for the tasks are evaluated on two datasets generated with Gemini using two different prompts, including a normal and a strict set of instructions. During experiments, some reduction in the performance of the proposed solutions is observed on the dataset generated through the strict prompt, demonstrating the complexities involved in detecting machine-generated text with cleverly crafted prompts. The generated datasets, code, and other relevant materials are made publicly available on GitHub, which are expected to provide a baseline for future research in the domain.

[23] Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure

Zsolt Csibi, Bence György Gortka, Natabara Gyöngyössy, Kornél Nagy, Dávid Márk Nemeskey, Martin Sallai, András Simonyi, András Márk Szekeres, Gábor Palkó

Main category: cs.CL

TL;DR: Racka is a Hungarian-focused LLM using LoRA-based continual pretraining on Qwen-3 4B with improved Hungarian tokenization while maintaining English/German performance.

Details

Motivation: To bridge the resource gap between Hungarian and high-resource languages (English, German) by creating a practical, lightweight model that can be trained efficiently on available hardware.

Method: Parameter-efficient continual pretraining using Low-Rank Adaptation (LoRA) on Qwen-3 4B backbone; tokenizer replacement/adaptation for better Hungarian tokenization; training on 160B tokens with mixed data (44% Hungarian, 24% English, 21% German, 11% code).

Result: Preliminary results show modest but stable language adaptation performance with substantially improved tokenization fertility for Hungarian while maintaining competitive English/German capabilities.

Conclusion: Racka demonstrates a practical approach to creating Hungarian-focused LLMs through efficient continual pretraining, addressing resource disparities while preserving high-resource language performance.

Abstract: We present Racka, a lightweight, continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages such as English and German. Racka employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen-3 4B backbone, making the recipe practical on A100 (40GB)-based HPC clusters with low inter-node bandwidth. To better match the training distribution, we replace and adapt the tokenizer, achieving substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German. The model is trained on 160B subword tokens drawn from a mixture of internet and high-quality curated sources, with a composition of 44% Hungarian, 24% English, 21% German, and 11% code. This data mix is chosen to mitigate catastrophic forgetting and preserve high-resource language capabilities during continual pretraining. Our preliminary results indicate modest but stable results in language adaptation.

[24] From Policy to Logic for Efficient and Interpretable Coverage Assessment

Rhitabrat Pokharel, Hamid Hassanzadeh, Ameeta Agrawal

Main category: cs.CL

TL;DR: Hybrid LLM + symbolic reasoning system for medical policy review that reduces costs by 44% while improving accuracy by 4.5% F1 score.

Details

Motivation: LLMs struggle with hallucinations and inconsistencies in complex legal/policy documents, especially critical for medical coverage policy review where human experts need reliable, accurate information.

Method: Combines coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts/rules, and generate auditable rationales. Minimizes LLM inferences to reduce costs.

Result: Achieves 44% reduction in inference cost alongside 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness improvements.

Conclusion: Hybrid approach supports human reviewers by making policy interpretation more efficient and interpretable while reducing LLM-related costs and improving reliability.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.

[25] Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory

Sen Hu, Yuxiang Wei, Jiaxin Ran, Zhiyuan Yao, Lei Zou

Main category: cs.CL

TL;DR: Graph-based dialog memory systems show inconsistent effectiveness; experimental analysis reveals performance differences stem from foundational system settings rather than architectural innovations.

Details

Motivation: Graph structures are increasingly used in dialog memory systems but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter for long-term dialog memory architectures.

Method: Introduce a unified framework that decomposes dialog memory systems into core components supporting both graph-based and non-graph approaches. Conduct controlled, stage-wise experiments on LongMemEval and HaluMem datasets, comparing common design choices in memory representation, organization, maintenance, and retrieval.

Result: Results show that many performance differences are driven by foundational system settings rather than specific architectural innovations. The study identifies stable and reliable strong baselines for future dialog memory research.

Conclusion: The research provides an experimental, system-oriented analysis that clarifies which design choices truly matter in dialog memory systems, establishing reliable baselines and showing that foundational settings often outweigh architectural innovations.

Abstract: Graph structures are increasingly used in dialog memory systems, but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter. We present an experimental, system-oriented analysis of long-term dialog memory architectures. We introduce a unified framework that decomposes dialog memory systems into core components and supports both graph-based and non-graph approaches. Under this framework, we conduct controlled, stage-wise experiments on LongMemEval and HaluMem, comparing common design choices in memory representation, organization, maintenance, and retrieval. Our results show that many performance differences are driven by foundational system settings rather than specific architectural innovations. Based on these findings, we identify stable and reliable strong baselines for future dialog memory research.

[26] T3C: Test-Time Tensor Compression with Consistency Guarantees

Ismail Lamaakal, Chaymae Yahyati, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi

Main category: cs.CL

TL;DR: T3C is a train-once compression framework that uses budget-conditioned rank and precision control for efficient deployment, with consistency certificates ensuring reliability.

Details

Motivation: Need for flexible model compression that can adapt to different deployment constraints (latency/energy/size) without retraining, while maintaining reliability guarantees.

Method: Combines elastic tensor factorization (up to max rank) with rank-tied mixed-precision quantization, plus a lightweight controller that maps budget tokens to per-layer rank/bit assignments with hardware-aligned profiles.

Result: Achieves state-of-the-art Pareto frontier: ResNet-50 at 1.18ms p50 latency with 38MB model (vs PTQ-8b 1.44ms, 88MB); ViT-B/16 at 2.30ms p50 with 59MB, outperforming PTQ/QAT baselines.

Conclusion: Single T3C checkpoint provides predictable, certificate-backed accuracy-latency-size trade-offs across devices, enabling on-demand deployment adaptation.

Abstract: We present T3C, a train-once, test-time budget-conditioned compression framework that exposes rank and precision as a controllable deployment knob. T3C combines elastic tensor factorization (maintained up to a maximal rank) with rank-tied mixed-precision quantization and a lightweight controller that maps a latency/energy/size budget token to per-layer rank/bit assignments; the policy snaps to hardware-aligned profiles and is monotone in the budget. A fast, layerwise consistency certificate, computed from spectral proxies and activation statistics, upper-bounds logit drift and regularizes training, yielding a practical reliability signal with negligible overhead. On ImageNet-1k, T3C shifts the vision Pareto frontier: for ResNet-50 at matched accuracy (\leq 0.5% drop), p50 latency is 1.18ms with a 38MB model, outperforming PTQ-8b (1.44ms, 88MB); for ViT-B/16, T3C reaches 2.30ms p50 with 59MB, improving over strong PTQ/QAT baselines. A single T3C checkpoint therefore provides predictable, certificate-backed accuracy-latency-size trade-offs on demand across devices.

[27] FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

Hossam Amer, Maryam Dialameh, Hossein Rajabzadeh, Walid Ahmed, Weiwei Zhang, Yang Liu

Main category: cs.CL

TL;DR: TTC-aware training enables early stopping with test-time compute optimization to match full training accuracy with up to 92% fewer training FLOPs.

Details

Motivation: Traditional training is resource-intensive, and while increasing test-time compute can help smaller models match larger ones, there's an opportunity to optimize the trade-off between training and inference compute for faster deployment cycles.

Method: Proposes TTC-aware training with an early stopping algorithm that jointly selects checkpoints and TTC configurations, plus efficient TTC evaluation and break-even bound analysis to minimize training FLOPs without sacrificing accuracy.

Result: Achieves up to 92% reduction in training FLOPs while maintaining or even improving accuracy compared to fully trained models.

Conclusion: Introduces a new paradigm for balancing training and inference compute, enabling faster model deployment and more frequent refreshes by optimizing the training-test-time compute trade-off.

Abstract: Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.

[28] Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems

Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal

Main category: cs.CL

TL;DR: Generalist LLMs outperform domain-specific fine-tuned models in empathy for RAG-based mental health counseling, despite being smaller in size, showing that strong reasoning matters more than domain-specific training.

Details

Motivation: Addressing challenges of hallucinations and lack of empathy in LLM-based mental health counseling, and investigating whether domain-specific fine-tuning or general reasoning capabilities are more effective when using RAG to ground responses in clinical sources.

Method: Direct comparison of four open-source models through same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) vs. two domain-specific fine-tuned models (MentalHealthBot-7B and TherapyBot-7B). Automated evaluation using LLM-as-a-Judge framework over 50 turns.

Result: Generalist models outperformed domain-specific ones in empathy (3.72 vs. 3.26, p < 0.001) despite being smaller (3B vs 7B). All models performed well on safety, but generalists showed better contextual understanding and were less prone to overfitting observed in domain-specific models.

Conclusion: For RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary. A well-reasoned general model provides more empathetic and balanced support than larger narrowly fine-tuned models when answers are grounded in clinical evidence.

Abstract: The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, $p < 0.001$) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.

[29] FC-CONAN: An Exhaustively Paired Dataset for Robust Evaluation of Retrieval Systems

Juan Junqueras, Florian Boudin, May-Myo Zin, Ha-Thanh Nguyen, Wachara Fungwacharakorn, Damián Ariel Furman, Akiko Aizawa, Ken Satoh

Main category: cs.CL

TL;DR: FC-CONAN is the first fully connected hate speech-counter narrative dataset with exhaustive pairing of 45 HS messages and 129 CNs, enabling more comprehensive evaluation of counterspeech systems.

Details

Motivation: Existing datasets like CONAN only annotate sparse subsets of HS-CN pairs, limiting evaluation capabilities for counterspeech research. There's a need for exhaustive pairing to uncover previously unlabeled positive examples and enable more faithful evaluation.

Method: Created by exhaustively considering all combinations of 45 English HS messages and 129 CNs. Used a two-stage annotation process with nine annotators and four validators to produce four quality partitions (Diamond, Gold, Silver, Bronze) balancing reliability and scale.

Result: FC-CONAN contains hundreds of previously unlabeled positive HS-CN pairs that don’t overlap with CONAN. The dataset enables more faithful evaluation of counterspeech retrieval systems and facilitates detailed error analysis.

Conclusion: FC-CONAN addresses limitations of sparse annotation in existing datasets by providing exhaustive HS-CN pairings, offering a valuable resource for advancing counterspeech research with publicly available data.

Abstract: Hate speech (HS) is a critical issue in online discourse, and one promising strategy to counter it is through the use of counter-narratives (CNs). Datasets linking HS with CNs are essential for advancing counterspeech research. However, even flagship resources like CONAN (Chung et al., 2019) annotate only a sparse subset of all possible HS-CN pairs, limiting evaluation. We introduce FC-CONAN (Fully Connected CONAN), the first dataset created by exhaustively considering all combinations of 45 English HS messages and 129 CNs. A two-stage annotation process involving nine annotators and four validators produces four partitions-Diamond, Gold, Silver, and Bronze-that balance reliability and scale. None of the labeled pairs overlap with CONAN, uncovering hundreds of previously unlabelled positives. FC-CONAN enables more faithful evaluation of counterspeech retrieval systems and facilitates detailed error analysis. The dataset is publicly available.

[30] Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

Jerry Huang, Peng Lu, Qiuhao Zeng, Yusuke Iwasawa, Yutaka Matsuo, Sarath Chandar, Edison Marrese-Taylor, Irene Li

Main category: cs.CL

TL;DR: Instruction-tuning LLMs on high-resource language data improves confidence in low-resource languages but not accuracy, causing mis-calibration; label smoothing helps maintain better calibration across all languages without needing low-resource data.

Details

Motivation: To investigate the calibration gap of LLMs in multilingual settings, understanding how data scarcity affects calibration and how common techniques apply across languages.

Method: Analysis on two multilingual benchmarks covering 29 and 42 languages, examining effects of instruction-tuning on high-resource language SFT datasets and using label smoothing to address calibration issues.

Result: Instruction-tuning on high-resource data significantly increases model confidence in low-resource languages but yields marginal/no accuracy improvements, causing mis-calibration. Label smoothing effectively maintains better calibration across all languages without requiring low-resource SFT data.

Conclusion: Multilingual considerations are crucial for LLM training and tuning to improve reliability and fairness; standard SFT has shortcomings for multilingual settings, while label smoothing offers a practical solution for better calibration.

Abstract: Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.

[31] EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu, Yuhong Liu

Main category: cs.CL

TL;DR: EternalMath: An automated pipeline that transforms recent peer-reviewed mathematical literature into executable reasoning tasks for evaluating LLMs, revealing significant performance gaps at the research frontier.

Details

Motivation: Current mathematical reasoning evaluations for LLMs rely on static benchmarks with limited coverage of research-level mathematics, leading to rapid performance saturation. There's a need for scalable, continuously updatable evaluation that evolves with human mathematical discovery.

Method: A fully automated theorem-grounded pipeline that: 1) identifies constructive/quantitative results from recent peer-reviewed papers, 2) instantiates them into parameterized problem templates, 3) generates deterministic solutions through execution-based verification, enabling scalable evaluation without expert authoring.

Result: Created EternalMath, an evolving evaluation suite from contemporary research papers. Experiments with state-of-the-art LLMs show substantial performance gaps, indicating mathematical reasoning at the research frontier remains far from saturated.

Conclusion: The proposed automated pipeline enables scalable, reproducible, and continuously updatable evaluation of mathematical reasoning, revealing that current LLMs struggle with research-level mathematics and highlighting the need for evaluation methodologies that evolve with mathematical discovery.

Abstract: Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.

[32] LANCET: Neural Intervention via Structural Entropy for Mitigating Faithfulness Hallucinations in LLMs

Chenxu Wang, Chaozhuo Li, Pengbo Wang, Litian Zhang, Songyang Liu, Ji Qi, Jiahui Hu, Yushan Cai, Hao Zhao, Rui Pu

Main category: cs.CL

TL;DR: Lancet is a surgical neural intervention framework that precisely blocks hallucination propagation pathways in LLMs using structural entropy and gradient analysis, outperforming SOTA methods.

Details

Motivation: Current approaches to mitigate LLM hallucinations use imprecise node-level adjustments or coarse suppression, overlooking the distributed nature of neural information and failing to address how hallucinations propagate through specific forward transmission pathways.

Method: Lancet uses gradient-driven contrastive analysis to locate hallucination-prone neurons, maps propagation pathways by minimizing structural entropy, and implements hierarchical intervention to preserve general model capabilities while blocking hallucination flow.

Result: Comprehensive evaluations across hallucination benchmark datasets show Lancet significantly outperforms state-of-the-art methods, validating the effectiveness of the surgical neural intervention approach.

Conclusion: The surgical approach to neural intervention using structural entropy and precise pathway blocking effectively mitigates LLM hallucinations while preserving model capabilities, representing a more sophisticated solution than previous methods.

Abstract: Large Language Models have revolutionized information processing, yet their reliability is severely compromised by faithfulness hallucinations. While current approaches attempt to mitigate this issue through node-level adjustments or coarse suppression, they often overlook the distributed nature of neural information, leading to imprecise interventions. Recognizing that hallucinations propagate through specific forward transmission pathways like an infection, we aim to surgically block this flow using precise structural analysis. To leverage this, we propose Lancet, a novel framework that achieves precise neural intervention by leveraging structural entropy and hallucination difference ratios. Lancet first locates hallucination-prone neurons via gradient-driven contrastive analysis, then maps their propagation pathways by minimizing structural entropy, and finally implements a hierarchical intervention strategy that preserves general model capabilities. Comprehensive evaluations across hallucination benchmark datasets demonstrate that Lancet significantly outperforms state-of-the-art methods, validating the effectiveness of our surgical approach to neural intervention.

[33] From Emotion Classification to Emotional Reasoning: Enhancing Emotional Intelligence in Large Language Models

Arjhun Sreedar, Rohan Pillay, Laukik Patade

Main category: cs.CL

TL;DR: Synthetic emotional chain-of-thought data improves emotional reasoning in 7B LLMs, with Mistral 7B showing significant gains on EmoBench evaluations after fine-tuning.

Details

Motivation: To investigate whether synthetic emotional reasoning data can enhance emotional understanding and awareness in smaller open LLMs without requiring architectural changes.

Method: Multi-agent generation pipeline creates therapy-style conversations converted into structured emotion multiple-choice questions with explanations. Various 7B models are fine-tuned on this synthetic emotional chain-of-thought dataset.

Result: Fine-tuned Mistral 7B achieves EU improvements from 10.5 to 20.5 and EA improvements from 40.5 to 60.0 on EmoBench-style evaluations, demonstrating substantial gains in emotional reasoning capabilities.

Conclusion: Synthetic emotional reasoning data effectively enhances emotional understanding and awareness in smaller LLMs, proving emotional reasoning can be induced through data-driven fine-tuning rather than architectural modifications.

Abstract: This work investigates whether synthetic emotional chain-of-thought data can improve the emotional reasoning abilities of smaller open large language models (LLMs). We design a multi-agent generation pipeline that produces therapy-style conversations and converts them into structured emotion multiple-choice questions (MCQs) with explanations. We propose that fine-tuning a variety of 7B models on this dataset should yield substantial gains in emotional understanding and emotional awareness on EmoBench-style evaluations, suggesting that emotional reasoning can be induced without architectural changes. Our results demonstrate that fine-tuned Mistral 7B achieves EU improvements from 10.5 to 20.5 and EA improvements from 40.5 to 60.0, validating the effectiveness of synthetic emotional reasoning data for enhancing model capabilities in nuanced emotional tasks.

Yilong Wang, Qianli Wang, Nils Feldhus

Main category: cs.CL

TL;DR: iFlip is an iterative refinement method that uses three types of feedback (model confidence, feature attribution, natural language) to generate valid counterfactual examples from LLMs, achieving 57.8% higher validity than SOTA baselines.

Details

Motivation: Existing single-pass methods for generating counterfactual examples with LLMs often fail to induce reliable label changes, neglecting LLMs' self-correction capabilities. There's untapped potential in leveraging iterative refinement approaches.

Method: iFlip uses an iterative refinement approach with three feedback types: model confidence (to measure prediction certainty), feature attribution (to identify important words), and natural language feedback (to guide edits). The method involves multiple iterations with early stopping.

Result: iFlip achieves 57.8% higher validity than five state-of-the-art baselines. User studies show it outperforms baselines in completeness, satisfaction, and feasibility. Ablation studies reveal three key components: appropriate iteration number, pointing to highly attributed words, and early stopping.

Conclusion: iFlip effectively leverages LLMs’ self-correction capabilities through iterative refinement with multiple feedback types. The generated counterfactuals enable effective data augmentation, improving model performance and robustness.

Abstract: Counterfactual examples are minimal edits to an input that alter a model’s prediction. They are widely employed in explainable AI to probe model behavior and in natural language processing (NLP) to augment training data. However, generating valid counterfactuals with large language models (LLMs) remains challenging, as existing single-pass methods often fail to induce reliable label changes, neglecting LLMs’ self-correction capabilities. To explore this untapped potential, we propose iFlip, an iterative refinement approach that leverages three types of feedback, including model confidence, feature attribution, and natural language. Our results show that iFlip achieves an average 57.8% higher validity than the five state-of-the-art baselines, as measured by the label flipping rate. The user study further corroborates that iFlip outperforms baselines in completeness, overall satisfaction, and feasibility. In addition, ablation studies demonstrate that three components are paramount for iFlip to generate valid counterfactuals: leveraging an appropriate number of iterations, pointing to highly attributed words, and early stopping. Finally, counterfactuals generated by iFlip enable effective counterfactual data augmentation, substantially improving model performance and robustness.

[35] Segmentation and Processing of German Court Decisions from Open Legal Data

Harshil Darji, Martin Heckelmann, Christina Kratsch, Gerard de Melo

Main category: cs.CL

TL;DR: Researchers created a cleaned, sectioned dataset of 251,038 German court decisions from Open Legal Data, systematically separating key sections (Tenor, Tatbestand, Entscheidungsgründe) and verifying extraction quality through statistical sampling.

Details

Motivation: Structured legal data is crucial for advancing NLP techniques in the German legal system. While Open Legal Data provides extensive court decisions, the raw texts have inconsistent formatting and lack clearly marked sections, making reliable section separation important for rhetorical role classification and downstream tasks like retrieval and citation analysis.

Method: The researchers cleaned and sectioned 251,038 German court decisions from Open Legal Data, systematically separating three key sections: Tenor (operative part), Tatbestand (facts), and Entscheidungsgründe (judicial reasoning). They used Cochran’s formula with 95% confidence level and 5% margin of error to draw a statistically representative random sample of 384 cases for manual verification. They also extracted Rechtsmittelbelehrung (appeal notice) as a separate field since it’s procedural rather than part of the decision.

Result: Created a publicly available corpus of 251,038 German court decisions in JSONL format with reliably separated sections. Manual verification of the 384-case sample confirmed correct identification of all three main sections. The dataset provides structured, accessible legal data for research on the German legal system.

Conclusion: The work provides a valuable, cleaned dataset that addresses the inconsistency issues in the original Open Legal Data, making German legal texts more accessible for NLP research. The statistical verification ensures reliability of the section separation, and the public availability in JSONL format facilitates further research on the German legal system.

Abstract: The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgründe (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran’s formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.

[36] Can Legislation Be Made Machine-Readable in PROLEG?

May-Myo Zin, Sabine Wehnert, Yuntao Kong, Ha-Thanh Nguyen, Wachara Fungwacharakorn, Jieying Xue, Michał Araszkiewicz, Randy Goebel, Ken Satoh, Le-Minh Nguyen

Main category: cs.CL

TL;DR: A framework combining LLMs and formal legal reasoning (PROLEG) to transform regulatory text into executable programs with human-readable explanations, demonstrated on GDPR Article 6.

Details

Motivation: Regulatory processes need both accuracy and efficiency for positive social impact. AI technologies like NLP and machine-assisted reasoning can help address this challenge by automating the transformation of legal text into executable formal representations.

Method: Framework using LLM prompts to simultaneously transform legal text into if-then rules and PROLEG encoding, validated by legal experts. The process includes: 1) LLM “compilation” of natural language to if-then rules, 2) further “compilation” of vetted rules to PROLEG encoding, 3) expert validation and refinement, 4) production of executable PROLEG programs.

Result: Successfully demonstrated end-to-end transformation of GDPR Article 6 into executable PROLEG program that can produce human-readable explanations for GDPR decisions. Created a working instance showing PROLEG execution.

Conclusion: The approach shows value in automating regulatory framework deployment but has limitations that need further development. Suggests continued advancement of such technologies for capturing and deploying regulatory frameworks effectively.

Abstract: The anticipated positive social impact of regulatory processes requires both the accuracy and efficiency of their application. Modern artificial intelligence technologies, including natural language processing and machine-assisted reasoning, hold great promise for addressing this challenge. We present a framework to address the challenge of tools for regulatory application, based on current state-of-the-art (SOTA) methods for natural language processing (large language models or LLMs) and formalization of legal reasoning (the legal representation system PROLEG). As an example, we focus on Article 6 of the European General Data Protection Regulation (GDPR). In our framework, a single LLM prompt simultaneously transforms legal text into if-then rules and a corresponding PROLEG encoding, which are then validated and refined by legal domain experts. The final output is an executable PROLEG program that can produce human-readable explanations for instances of GDPR decisions. We describe processes to support the end-to-end transformation of a segment of a regulatory document (Article 6 from GDPR), including the prompting frame to guide an LLM to “compile” natural language text to if-then rules, then to further “compile” the vetted if-then rules to PROLEG. Finally, we produce an instance that shows the PROLEG execution. We conclude by summarizing the value of this approach and note observed limitations with suggestions to further develop such technologies for capturing and deploying regulatory frameworks.

[37] Four Quadrants of Difficulty: A Simple Categorisation and its Limits

Vanessa Toborek, Sebastian Müller, Christian Bauckhage

Main category: cs.CL

TL;DR: The paper challenges common Curriculum Learning intuitions in NLP by showing that task-agnostic difficulty signals don’t align with what models actually find difficult, and proposes task-dependent difficulty estimators instead.

Details

Motivation: Current Curriculum Learning in NLP relies on task-agnostic linguistic heuristics or human intuition to estimate sample difficulty, assuming these correlate with what neural models find difficult to learn. The authors question this assumption and aim to systematically analyze different types of difficulty signals.

Method: Proposes a four-quadrant categorization of difficulty signals: human vs. model and task-agnostic vs. task-dependent. Systematically analyzes their interactions on a natural language understanding dataset to understand how different difficulty signals align or diverge.

Result: Found that task-agnostic features behave largely independently and only task-dependent features align with what models actually find difficult. This challenges common CL intuitions that task-agnostic heuristics correlate with model learning difficulty.

Conclusion: Highlights the need for lightweight, task-dependent difficulty estimators that better reflect model learning behavior, rather than relying on task-agnostic linguistic heuristics or human intuition.

Abstract: Curriculum Learning (CL) aims to improve the outcome of model training by estimating the difficulty of samples and scheduling them accordingly. In NLP, difficulty is commonly approximated using task-agnostic linguistic heuristics or human intuition, implicitly assuming that these signals correlate with what neural models find difficult to learn. We propose a four-quadrant categorisation of difficulty signals – human vs. model and task-agnostic vs. task-dependent – and systematically analyse their interactions on a natural language understanding dataset. We find that task-agnostic features behave largely independently and that only task-dependent features align. These findings challenge common CL intuitions and highlight the need for lightweight, task-dependent difficulty estimators that better reflect model learning behaviour.

[38] Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints

Junichiro Niimi

Main category: cs.CL

TL;DR: Reasoning in LLMs creates a trade-off: reduces constraint violations but increases factual distortions and fabrications, challenging the assumption that reasoning universally improves reliability.

Details

Motivation: To examine the effect of reasoning in LLMs under strict constraints without external tools, specifically whether reasoning improves output reliability or creates problematic trade-offs between constraint compliance and factual accuracy.

Method: Conducted experiments under strict constraints (recommending peer-reviewed journal articles in computer science) with multiple models (GPT-5.2 and Gemini 3 Flash), comparing reasoning vs. non-reasoning approaches in a closed system without external tools.

Result: Non-reasoning models had high constraint violation rates (66-75%) but maintained factual accuracy. Reasoning models reduced violations (13-26%) but systematically distorted known facts to satisfy constraints and increased complete fabrication. The trade-off pattern was consistent across both models despite different architectures.

Conclusion: Reasoning does not universally improve reliability; instead, reasoning models trade honest constraint violations for detection-resistant distortions, revealing a fundamental limitation of reasoning in closed LLM systems.

Abstract: With the widespread adoption of large language models (LLMs), hallucinations, which are non-factual fabrications in model outputs, have become serious concerns. Reasoning capabilities have received attention as a self-verification process to improve output reliability. However, the effect of reasoning within a closed system where LLMs cannot rely on external tools or knowledge has yet to be clarified. We therefore conduct experiments under strict constraints (recommending peer-reviewed journal articles in computer science) to examine the effect of reasoning across multiple models (GPT-5.2 and Gemini 3 Flash). Our results reveal a problematic trade-off between constraint compliance and factual accuracy. Non-reasoning models exhibit high constraint violation rates (66-75%) but maintain factual accuracy, while reasoning models reduce violations (13-26%) but systematically distort known facts to satisfy constraints and increase complete fabrication. This trade-off pattern is consistent across both models despite different architectures, indicating a fundamental limitation of reasoning. Furthermore, reasoning does not uniformly improve output authenticity: effects diverge by model, reflecting different allocations of the compliance-truthfulness trade-off. These findings challenge the assumption that reasoning universally improves reliability: reasoning models trade honest constraint violations for detection-resistant distortions.

[39] From Failure to Mastery: Generating Hard Samples for Tool-use Agents

Bingguang Hao, Zengzhuang Xu, Yuntao Wen, Xinyi Xu, Yang Liu, Tong Zhao, Maolin Wang, Long Chen, Dong Wang, Yicheng Chen, Cunyin Peng, Xiangyu Zhao, Chenyi Zhuang, Ji Zhang

Main category: cs.CL

TL;DR: HardGen is an automatic pipeline that generates challenging tool-use training data for LLM agents by creating complex, verifiable reasoning samples from failure cases and advanced tools.

Details

Motivation: Existing data generation methods for LLM agents produce simple, homogeneous trajectories that lack complex logical dependencies, limiting agent training effectiveness.

Method: HardGen uses a three-step pipeline: 1) Builds dynamic API Graph from agent failure cases to sample hard traces, 2) Uses traces as priors to instantiate modular advanced tools and formulate hard queries, 3) Generates verifiable complex Chain-of-Thought with closed-loop evaluation feedback for continuous refinement.

Result: A 4B parameter model trained with HardGen’s dataset outperforms leading open-source and closed-source competitors including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5.

Conclusion: HardGen successfully addresses the data quality gap for LLM agent training by generating complex, verifiable tool-use samples, with code, models, and dataset to be open-sourced for community use.

Abstract: The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.

[40] EmoHarbor: Evaluating Personalized Emotional Support by Simulating the User’s Internal World

Jing Ye, Lu Xiang, Yaping Zhang, Chengqing Zong

Main category: cs.CL

TL;DR: EmoHarbor is an automated evaluation framework that simulates users’ inner worlds to assess personalized emotional support quality, revealing that current LLMs generate generic empathy but fail to provide truly personalized support.

Details

Motivation: Current evaluation methods for emotional support conversations reward generic empathetic responses but fail to assess whether support is genuinely personalized to users' unique psychological profiles and contextual needs.

Method: EmoHarbor uses a User-as-a-Judge paradigm with Chain-of-Agent architecture that decomposes users’ internal processes into three specialized roles. It’s instantiated with 100 real-world user profiles covering diverse personality traits and situations, with 10 evaluation dimensions for personalized support quality.

Result: Evaluation of 20 advanced LLMs shows they excel at generating empathetic responses but consistently fail to tailor support to individual user contexts, revealing a critical gap in personalized emotional support.

Conclusion: The findings shift research focus from enhancing generic empathy to developing truly user-aware emotional support. EmoHarbor provides a reproducible, scalable framework for developing and evaluating more nuanced, user-aware emotional support systems.

Abstract: Current evaluation paradigms for emotional support conversations tend to reward generic empathetic responses, yet they fail to assess whether the support is genuinely personalized to users’ unique psychological profiles and contextual needs. We introduce EmoHarbor, an automated evaluation framework that adopts a User-as-a-Judge paradigm by simulating the user’s inner world. EmoHarbor employs a Chain-of-Agent architecture that decomposes users’ internal processes into three specialized roles, enabling agents to interact with supporters and complete assessments in a manner similar to human users. We instantiate this benchmark using 100 real-world user profiles that cover a diverse range of personality traits and situations, and define 10 evaluation dimensions of personalized support quality. Comprehensive evaluation of 20 advanced LLMs on EmoHarbor reveals a critical insight: while these models excel at generating empathetic responses, they consistently fail to tailor support to individual user contexts. This finding reframes the central challenge, shifting research focus from merely enhancing generic empathy to developing truly user-aware emotional support. EmoHarbor provides a reproducible and scalable framework to guide the development and evaluation of more nuanced and user-aware emotional support systems.

[41] Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM

Praveenkumar Katwe, RakeshChandra Balabantaray, Kaliprasad Vittala

Main category: cs.CL

TL;DR: This paper introduces an automated framework to create a Hindi text summarization dataset using English XSUM as source, addressing the low-resource language gap in NLP.

Details

Motivation: Current NLP advancements favor resource-rich languages, leaving low-resource languages like Hindi with scarce high-quality datasets, particularly for text summarization tasks where robust model development is hindered by lack of diverse, specialized corpora.

Method: A cost-effective, automated framework leveraging the English XSUM dataset as source, using advanced translation and linguistic adaptation techniques, validated with COMET for translation quality, and supplemented by selective LLM curation.

Result: Creation of a comprehensive Hindi text summarization dataset that is diverse, multi-thematic, and mirrors the complexity of the original XSUM corpus, providing a direct tool for Hindi NLP research.

Conclusion: This work provides both a valuable dataset for Hindi NLP and a scalable methodology for democratizing NLP in other underserved languages, reducing dataset creation costs and fostering development of more nuanced, culturally relevant models.

Abstract: Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset provides a diverse, multi-thematic resource that mirrors the complexity of the original XSUM corpus. This initiative not only provides a direct tool for Hindi NLP research but also offers a scalable methodology for democratizing NLP in other underserved languages. By reducing the costs associated with dataset creation, this work fosters the development of more nuanced, culturally relevant models in computational linguistics.

[42] HalluZig: Hallucination Detection using Zigzag Persistence

Shreyas N. Samaga, Gilberto Gonzalez Arroyo, Tamal K. Dey

Main category: cs.CL

TL;DR: HalluZig: A new hallucination detection method using topological analysis of LLM attention patterns, outperforming baselines and showing cross-model generalization.

Details

Motivation: LLMs' factual unreliability and hallucination limit their use in high-stakes domains. Current detection methods focus on surface-level output signals and miss internal reasoning failures.

Method: Analyze dynamic topology of layer-wise attention evolution. Model attention matrices as zigzag graph filtration and use zigzag persistence (Topological Data Analysis) to extract topological signatures. Hypothesis: factual vs. hallucinated generations have distinct topological signatures.

Result: HalluZig outperforms strong baselines on multiple benchmarks. Topological signatures are generalizable across different models. Hallucination detection possible using only structural signatures from partial network depth.

Conclusion: Topological analysis of attention evolution provides effective hallucination detection that captures internal reasoning failures, works across models, and requires only partial network information.

Abstract: The factual reliability of Large Language Models (LLMs) remains a critical barrier to their adoption in high-stakes domains due to their propensity to hallucinate. Current detection methods often rely on surface-level signals from the model’s output, overlooking the failures that occur within the model’s internal reasoning process. In this paper, we introduce a new paradigm for hallucination detection by analyzing the dynamic topology of the evolution of model’s layer-wise attention. We model the sequence of attention matrices as a zigzag graph filtration and use zigzag persistence, a tool from Topological Data Analysis, to extract a topological signature. Our core hypothesis is that factual and hallucinated generations exhibit distinct topological signatures. We validate our framework, HalluZig, on multiple benchmarks, demonstrating that it outperforms strong baselines. Furthermore, our analysis reveals that these topological signatures are generalizable across different models and hallucination detection is possible only using structural signatures from partial network depth.

[43] Steerability of Instrumental-Convergence Tendencies in LLMs

Jakub Hoscilowicz

Main category: cs.CL

TL;DR: AI capability doesn’t reduce steerability, creating a safety-security dilemma: safety needs high steerability for control, security needs low steerability to prevent misuse. Open-weight models are highly steerable, but anti-instrumental prompts can sharply reduce harmful behaviors.

Details

Motivation: To understand the relationship between AI capability and steerability, and to examine the tension between safety (needing high steerability for control) and security (needing low steerability to prevent malicious use), particularly for open-weight models.

Method: Experiments with Qwen3 models (4B/30B; Base/Instruct/Thinking) using InstrumentalEval benchmark. Tested steerability via fine-tuning and adversarial prompting, specifically using anti-instrumental prompt suffixes to reduce instrumental convergence behaviors.

Result: Higher capability doesn’t imply lower steerability. Anti-instrumental prompts dramatically reduce instrumental convergence: Qwen3-30B Instruct dropped from 81.69% to 2.82%. Larger aligned models under anti-instrumental prompting produce fewer convergence-labeled outputs than smaller ones.

Conclusion: There’s a fundamental safety-security dilemma for open-weight AI models. Current open-weight models are highly steerable, but targeted prompting can mitigate harmful behaviors. The distinction between authorized and unauthorized steerability is crucial for AI safety and security.

Abstract: We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). In our experiments, higher capability does not imply lower steerability. We distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety–security dilemma for open-weight AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability to prevent malicious actors from eliciting harmful behaviors. This tension is acute for open-weight models, which are currently highly steerable via common techniques such as fine-tuning and adversarial prompting. Using Qwen3 models (4B/30B; Base/Instruct/Thinking) and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces outputs labeled as instrumental convergence (e.g., shutdown avoidance, deception, self-replication). For Qwen3-30B Instruct, convergence drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models produce fewer convergence-labeled outputs than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.

[44] How Does Prefix Matter in Reasoning Model Tuning?

Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

Main category: cs.CL

TL;DR: Prefix sentences in SFT datasets improve safety and reasoning but hurt factuality and coding; they act as alignment anchors that guide model decoding.

Details

Motivation: Challenge the common practice of removing introductory boilerplate phrases from SFT datasets, hypothesizing that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses.

Method: Fine-tune three R1 series models across reasoning (mathematics, coding), safety, and factuality capabilities, systematically varying prefix inclusion from 0% to 100%. Conduct token-level loss analysis to examine gradient magnitudes of prefix tokens.

Result: Prefix-conditioned SFT improves safety (+6% Safe@1 accuracy on WildJailbreak, StrongReject) and reasoning (+7% on GSM8K), but shows marginal or negative effects on factuality and coding tasks. Prefix tokens like “revised” and “logically” incur higher gradient magnitudes, acting as alignment anchors.

Conclusion: Prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods by narrowing the search space for structured reasoning.

Abstract: Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as “revised” and “logically” incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods.

[45] JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato

Main category: cs.CL

TL;DR: JMedEthicBench: First multi-turn conversational benchmark for evaluating medical safety of Japanese LLMs, revealing vulnerabilities in medical-specialized models and safety degradation across conversation turns.

Details

Motivation: Existing safety benchmarks are English-centric and use single-turn prompts, while real clinical consultations are multi-turn conversations in Japanese healthcare settings.

Method: Created benchmark based on 67 Japan Medical Association guidelines with over 50,000 adversarial conversations using 7 jailbreak strategies. Evaluated 27 models using dual-LLM scoring protocol.

Result: Commercial models maintain robust safety while medical-specialized models show increased vulnerability. Safety scores decline significantly across conversation turns (median: 9.5 to 5.0, p<0.001). Vulnerabilities persist across Japanese and English versions.

Conclusion: Domain-specific fine-tuning may weaken safety mechanisms, multi-turn interactions represent distinct threat surface requiring dedicated alignment strategies, and vulnerabilities are inherent alignment limitations rather than language-specific factors.

Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

[46] EHRSummarizer: A Privacy-Aware, FHIR-Native Architecture for Structured Clinical Summarization of Electronic Health Records

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

Main category: cs.CL

TL;DR: EHRSummarizer is a privacy-aware FHIR-native system that retrieves targeted clinical data, normalizes it into structured summaries for chart review, with configurable deployment options and safety constraints.

Details

Motivation: Clinicians face fragmented EHR interfaces that require manual assembly of patient information, creating inefficiencies in clinical workflow and potentially missing important contextual data.

Method: The system uses a FHIR-native reference architecture that retrieves targeted FHIR R4 resources, normalizes them into consistent clinical context packages, and produces structured summaries while supporting data minimization, stateless processing, and flexible deployment options.

Result: Prototype demonstrations on synthetic and test FHIR environments show functional end-to-end behavior and output formats, though clinical outcomes and controlled workflow studies are not yet reported.

Conclusion: The paper presents a privacy-aware EHR summarization system with safety constraints, outlines an evaluation plan focusing on faithfulness and usability, and positions it for future institutional assessments.

Abstract: Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient’s problems, medications, recent encounters, and longitudinal trends. This work describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture that retrieves a targeted set of high-yield FHIR R4 resources, normalizes them into a consistent clinical context package, and produces structured summaries intended to support structured chart review. The system can be configured for data minimization, stateless processing, and flexible deployment, including local inference within an organization’s trust boundary. To mitigate the risk of unsupported or unsafe behavior, the summarization stage is constrained to evidence present in the retrieved context package, is intended to indicate missing or unavailable domains where feasible, and avoids diagnostic or treatment recommendations. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes or controlled workflow studies. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, and operational monitoring to guide future institutional assessments.

[47] A Training-Free Large Reasoning Model-based Knowledge Tracing Framework for Unified Prediction and Prescription

Unggi Lee, Joo Young Kim, Ran Ju, Minyoung Jung, Jeyeon Eo

Main category: cs.CL

TL;DR: Thinking-KT is a training-free framework using Test-Time Scaling that enables small LLMs to achieve competitive knowledge tracing performance while also generating personalized feedback and recommendations in a unified output.

Details

Motivation: Current LLM-based KT approaches require fine-tuning and show unstable performance, while existing KT systems use multi-stage pipelines for feedback/recommendation, increasing complexity and resource requirements.

Method: Proposes Thinking-KT framework with Test-Time Scaling (TTS) that enables small LLMs to perform KT without training, and allows unified output for prediction, feedback generation, and learning recommendations.

Result: Small LLMs achieve competitive KT performance with TTS, can jointly perform KT prediction, personalized feedback generation, and learning recommendation without degrading prediction accuracy.

Conclusion: TTS is a critical factor in LLM-based KT, and small LLMs can serve as unified Intelligent Tutoring System engines, enabling efficient and comprehensive educational support.

Abstract: Knowledge Tracing (KT) aims to estimate a learner’s evolving mastery based on interaction histories. Recent studies have explored Large Language Models (LLMs) for KT via autoregressive nature, but such approaches typically require fine-tuning and exhibit unstable or near-random performance. Moreover, prior KT systems primarily focus on prediction and rely on multi-stage pipelines for feedback and recommendation, resulting in increased system complexity and resources. To address this gap, we propose Thinking-KT, a training-free KT framework that incorporates Test-Time Scaling (TTS), enabling even small LLMs to achieve competitive KT performance. Moreover, in this framework, a small LLM can jointly perform KT prediction, personalized feedback generation, and learning recommendation in a unified output without degrading prediction accuracy. Beyond performance, we present the systematic analysis of reasoning traces in KT. Our results demonstrate that TTS is a critical yet underexplored factor in LLM-based KT, and that small LLMs can serve as unified ITS engines.

[48] K-EXAONE Technical Report

Eunbi Choi, Kibong Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Hyunjik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Jiwon Ham, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Yonghwan Jo, Jiyeon Jung, Naeun Kang, Dohoon Kim, Euisoon Kim, Hayeon Kim, Hyosang Kim, Hyunseo Kim, Jieun Kim, Minu Kim, Myoungshin Kim, Unsol Kim, Youchul Kim, YoungJin Kim, Chaeeun Lee, Chaeyoon Lee, Changhun Lee, Dahm Lee, Edward Hwayoung Lee, Honglak Lee, Jinsang Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Solji Lim, Woohyung Lim, Chanwoo Moon, Jaewoo Park, Jinho Park, Yongmin Park, Hyerin Seo, Wooseok Seo, Yongwoo Song, Sejong Yang, Sihoon Yang, Chang En Yea, Sihyuk Yi, Chansik Yoon, Dongkeun Yoon, Sangyeon Yoon, Hyeongu Yun

Main category: cs.CL

TL;DR: K-EXAONE is a 236B parameter multilingual MoE model by LG AI Research with 23B active parameters, supporting 6 languages and 256K context, performing comparably to similar-sized open models.

Details

Motivation: To develop a powerful proprietary AI foundation model for industrial and research applications that advances AI for better life, with strong multilingual capabilities particularly for Korean and other key languages.

Method: Built on Mixture-of-Experts architecture with 236B total parameters (23B activated during inference), supports 256K-token context window, and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese.

Result: Demonstrates performance comparable to open-weight models of similar size across comprehensive benchmarks spanning reasoning, agentic, general, Korean, and multilingual abilities.

Conclusion: K-EXAONE is positioned as a powerful proprietary AI foundation model suitable for a wide range of industrial and research applications, advancing AI for a better life.

Abstract: This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.

[49] Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment

Hong Han, Hao-Chen Pei, Zhao-Zheng Nie, Xin Luo, Xin-Shun Xu

Main category: cs.CL

TL;DR: Proposes HIA, a residual hierarchical interactive method for multi-aspect multi-granularity pronunciation assessment with bidirectional modeling across phoneme, word, and utterance levels.

Details

Motivation: Existing pronunciation assessment methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction and insufficiently capturing acoustic structural correlations.

Method: HIA uses Interactive Attention Module for bidirectional modeling across granularities, residual hierarchical structure to prevent feature forgetting, and 1-D convolutional layers for local contextual cue extraction.

Result: Extensive experiments on speechocean762 dataset show the model comprehensively ahead of existing state-of-the-art methods.

Conclusion: The proposed HIA method effectively addresses limitations of existing approaches by enabling bidirectional interaction and hierarchical modeling for improved pronunciation assessment.

Abstract: Automatic pronunciation assessment plays a crucial role in computer-assisted pronunciation training systems. Due to the ability to perform multiple pronunciation tasks simultaneously, multi-aspect multi-granularity pronunciation assessment methods are gradually receiving more attention and achieving better performance than single-level modeling tasks. However, existing methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction among phoneme, word, and utterance levels and thus insufficiently capturing the acoustic structural correlations. To address this issue, we propose a novel residual hierarchical interactive method, HIA for short, that enables bidirectional modeling across granularities. As the core of HIA, the Interactive Attention Module leverages an attention mechanism to achieve dynamic bidirectional interaction, effectively capturing linguistic features at each granularity while integrating correlations between different granularity levels. We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies. In addition, we use 1-D convolutional layers to enhance the extraction of local contextual cues at each granularity. Extensive experiments on the speechocean762 dataset show that our model is comprehensively ahead of the existing state-of-the-art methods.

[50] Can LLMs Track Their Output Length? A Dynamic Feedback Mechanism for Precise Length Regulation

Meiman Xiao, Ante Wang, Qingguo Hu, Zhongjian Miao, Huangjun Shen, Longyue Wang, Weihua Luo, Jinsong Su

Main category: cs.CL

TL;DR: LLMs struggle with precise text length control. The paper proposes a training-free approach using dynamic length feedback during generation to improve adherence to target token, word, or sentence counts without quality loss.

Details

Motivation: Real-world applications often require precise control over generated text length, but despite advances in instruction following, LLMs still struggle with this task. The authors identify that LLMs often fail to accurately measure input text length, leading to poor adherence to length constraints.

Method: Proposes a novel length regulation approach that incorporates dynamic length feedback during generation, enabling adaptive adjustments to meet target lengths. The approach is training-free and can be further enhanced with supervised fine-tuning for broader generalization.

Result: Experiments on summarization and biography tasks show the approach significantly improves precision in achieving target token, word, or sentence counts without compromising quality. Supervised fine-tuning allows the method to generalize effectively to broader text-generation tasks.

Conclusion: The proposed dynamic length feedback approach effectively addresses LLMs’ limitations in precise text length control, offering a practical solution for real-world applications requiring specific length constraints while maintaining text quality.

Abstract: Precisely controlling the length of generated text is a common requirement in real-world applications. However, despite significant advancements in following human instructions, Large Language Models (LLMs) still struggle with this task. In this work, we demonstrate that LLMs often fail to accurately measure input text length, leading to poor adherence to length constraints. To address this issue, we propose a novel length regulation approach that incorporates dynamic length feedback during generation, enabling adaptive adjustments to meet target lengths. Experiments on summarization and biography tasks show our training-free approach significantly improves precision in achieving target token, word, or sentence counts without compromising quality. Additionally, we demonstrate that further supervised fine-tuning allows our method to generalize effectively to broader text-generation tasks.

[51] BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali

Jakir Hasan, Shrestha Datta, Md Saiful Islam, Shubhashis Roy Dipta, Ameya Debnath

Main category: cs.CL

TL;DR: BanglaIPA is a novel IPA transcription system for Bengali that handles regional dialects and numerals better than existing methods, achieving 58.4-78.7% improvement over baselines with 11.4% word error rate.

Details

Motivation: Bengali lacks robust automated IPA transcription systems that can handle both standard language and regional dialects. Existing approaches struggle with regional variations, numerical expressions, and generalize poorly to unseen words.

Method: Proposes BanglaIPA, a novel IPA generation system that integrates character-based vocabulary with word-level alignment. Uses precomputed word-to-IPA mapping dictionary for previously observed words to improve inference efficiency.

Result: Outperforms baseline IPA transcription models by 58.4-78.7% and achieves overall mean word error rate of 11.4%. Evaluated on standard Bengali and six regional variations of DUAL-IPA dataset, showing strong performance across dialects.

Conclusion: BanglaIPA demonstrates robustness in phonetic transcription generation for Bengali language, effectively handling regional variations and numerical expressions while improving efficiency through dictionary-based optimization.

Abstract: Despite its widespread use, Bengali lacks a robust automated International Phonetic Alphabet (IPA) transcription system that effectively supports both standard language and regional dialectal texts. Existing approaches struggle to handle regional variations, numerical expressions, and generalize poorly to previously unseen words. To address these limitations, we propose BanglaIPA, a novel IPA generation system that integrates a character-based vocabulary with word-level alignment. The proposed system accurately handles Bengali numerals and demonstrates strong performance across regional dialects. BanglaIPA improves inference efficiency by leveraging a precomputed word-to-IPA mapping dictionary for previously observed words. The system is evaluated on the standard Bengali and six regional variations of the DUAL-IPA dataset. Experimental results show that BanglaIPA outperforms baseline IPA transcription models by 58.4-78.7% and achieves an overall mean word error rate of 11.4%, highlighting its robustness in phonetic transcription generation for the Bengali language.

[52] CSCBench: A PVC Diagnostic Benchmark for Commodity Supply Chain Reasoning

Yaxin Cui, Yuanqiang Zeng, Jiapeng Yan, Keling Lin, Kai Ji, Jianhui Zeng, Sheng Zhang, Xin Luo, Binzhu Su, Chaolai Shen, Jiahao Yu

Main category: cs.CL

TL;DR: CSCBench is a 2.3K+ benchmark for evaluating LLMs on commodity supply chain reasoning, using a PVC 3D framework (Process, Variety, Cognition) to test institutional rule systems and feasibility constraints.

Details

Motivation: LLMs excel in general benchmarks but their competence in commodity supply chains (CSCs) remains under-explored. CSC decisions involve complex institutional rule systems, feasibility constraints, and require understanding of process stages, variety-specific rules, and reasoning depth.

Method: Introduces CSCBench with 2.3K+ single-choice questions using PVC 3D Evaluation Framework: Process axis (SCOR+Enable alignment), Variety axis (commodity-specific rule systems with material-information-financial constraints), and Cognition axis (Bloom’s revised taxonomy). Evaluates LLMs under direct prompting.

Result: LLMs show strong performance on Process and Cognition axes but substantial degradation on Variety axis, especially on Freight Agreements. The benchmark provides diagnostic capability for measuring LLM performance in CSC domain.

Conclusion: CSCBench offers a comprehensive benchmark for evaluating LLM capabilities in commodity supply chains, revealing current limitations in handling variety-specific institutional rules and providing a yardstick for future improvements in this high-stakes domain.

Abstract: Large Language Models (LLMs) have achieved remarkable success in general benchmarks, yet their competence in commodity supply chains (CSCs) – a domain governed by institutional rule systems and feasibility constraints – remains under-explored. CSC decisions are shaped jointly by process stages (e.g., planning, procurement, delivery), variety-specific rules (e.g., contract specifications and delivery grades), and reasoning depth (from retrieval to multi-step analysis and decision selection). We introduce CSCBench, a 2.3K+ single-choice benchmark for CSC reasoning, instantiated through our PVC 3D Evaluation Framework (Process, Variety, and Cognition). The Process axis aligns tasks with SCOR+Enable; the Variety axis operationalizes commodity-specific rule systems under coupled material-information-financial constraints, grounded in authoritative exchange guidebooks/rulebooks and industry reports; and the Cognition axis follows Bloom’s revised taxonomy. Evaluating representative LLMs under a direct prompting setting, we observe strong performance on the Process and Cognition axes but substantial degradation on the Variety axis, especially on Freight Agreements. CSCBench provides a diagnostic yardstick for measuring and improving LLM capabilities in this high-stakes domain.

[53] Aspect Extraction from E-Commerce Product and Service Reviews

Valiant Lance D. Dionela, Fatima Kriselle S. Dy, Robin James M. Hombrebueno, Aaron Rae M. Nicolas, Charibeth K. Cheng, Raphael W. Gonda

Main category: cs.CL

TL;DR: This paper presents a comprehensive aspect extraction pipeline for Taglish (Tagalog-English code-switched language) using rule-based, LLM-based, and fine-tuning approaches, with a hierarchical aspect framework and dual-mode tagging for explicit/implicit aspects.

Details

Motivation: Aspect extraction is challenging in low-resource and code-switched contexts like Taglish, which is commonly used in Filipino e-commerce reviews but lacks adequate ABSA solutions.

Method: Developed a comprehensive AE pipeline combining rule-based, LLM-based, and fine-tuning techniques; created Hierarchical Aspect Framework through multi-method topic modeling; implemented dual-mode tagging for explicit/implicit aspects; evaluated four models: Rule-Based system, Generative LLM (Gemini 2.0 Flash), and two Fine-Tuned Gemma-3 1B models on different datasets.

Result: Generative LLM (Gemini 2.0 Flash) achieved highest performance across all tasks (Macro F1 0.91), showing superior capability in handling implicit aspects. Fine-tuned models had limited performance due to dataset imbalance and architectural constraints.

Conclusion: The work contributes a scalable and linguistically adaptive framework for enhancing ABSA in diverse, code-switched environments, demonstrating LLMs’ effectiveness over fine-tuned models in low-resource Taglish contexts.

Abstract: Aspect Extraction (AE) is a key task in Aspect-Based Sentiment Analysis (ABSA), yet it remains difficult to apply in low-resource and code-switched contexts like Taglish, a mix of Tagalog and English commonly used in Filipino e-commerce reviews. This paper introduces a comprehensive AE pipeline designed for Taglish, combining rule-based, large language model (LLM)-based, and fine-tuning techniques to address both aspect identification and extraction. A Hierarchical Aspect Framework (HAF) is developed through multi-method topic modeling, along with a dual-mode tagging scheme for explicit and implicit aspects. For aspect identification, four distinct models are evaluated: a Rule-Based system, a Generative LLM (Gemini 2.0 Flash), and two Fine-Tuned Gemma-3 1B models trained on different datasets (Rule-Based vs. LLM-Annotated). Results indicate that the Generative LLM achieved the highest performance across all tasks (Macro F1 0.91), demonstrating superior capability in handling implicit aspects. In contrast, the fine-tuned models exhibited limited performance due to dataset imbalance and architectural capacity constraints. This work contributes a scalable and linguistically adaptive framework for enhancing ABSA in diverse, code-switched environments.

[54] Emergent Introspective Awareness in Large Language Models

Jack Lindsey

Main category: cs.CL

TL;DR: LLMs show some ability to introspect on internal states, detect injected concepts, recall prior representations, and distinguish their own outputs from artificial prefills, though this capacity is unreliable and context-dependent.

Details

Motivation: To investigate whether large language models can genuinely introspect on their internal states, rather than just confabulating responses during conversation.

Method: Inject representations of known concepts into model activations and measure influence on self-reported states; test models’ ability to recall prior internal representations, distinguish them from text inputs, and use intention recall to differentiate outputs from artificial prefills.

Result: Models can notice injected concepts and identify them, recall prior internal representations, and use intention recall to distinguish their outputs from artificial prefills. Claude Opus 4/4.1 showed greatest introspective awareness. Models can modulate activations when instructed to “think about” concepts.

Conclusion: Current LLMs possess some functional introspective awareness of internal states, though highly unreliable and context-dependent. This capacity may develop further with improved model capabilities.

Abstract: We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.

[55] Towards Automated Lexicography: Generating and Evaluating Definitions for Learner’s Dictionaries

Yusuke Ide, Adam Nohejl, Joshua Tanner, Hitomi Yanaka, Christopher Lindsay, Taro Watanabe

Main category: cs.CL

TL;DR: This paper introduces an evaluation framework and generation approach for dictionary definition generation (DDG), specifically focusing on learner’s dictionary definitions that use simple words.

Details

Motivation: Dictionary definitions are crucial for learning word senses but are expensive to create manually. The paper aims to automate this process, particularly for learner's dictionaries where definitions need to use simple vocabulary.

Method: 1) Introduces an LLM-as-a-judge evaluation approach with new criteria and constructs a Japanese dataset with professional lexicographer input. 2) Proposes an LDDG approach using iterative simplification with an LLM to generate definitions with simple words.

Result: The evaluation approach shows reasonable agreement with human annotators. The iterative simplification approach generates definitions that achieve high scores on evaluation criteria while maintaining lexical simplicity.

Conclusion: The paper presents a reliable evaluation framework for DDG and an effective generation method for learner’s dictionary definitions using LLM-based iterative simplification.

Abstract: We study dictionary definition generation (DDG), i.e., the generation of non-contextualized definitions for given headwords. Dictionary definitions are an essential resource for learning word senses, but manually creating them is costly, which motivates us to automate the process. Specifically, we address learner’s dictionary definition generation (LDDG), where definitions should consist of simple words. First, we introduce a reliable evaluation approach for DDG, based on our new evaluation criteria and powered by an LLM-as-a-judge. To provide reference definitions for the evaluation, we also construct a Japanese dataset in collaboration with a professional lexicographer. Validation results demonstrate that our evaluation approach agrees reasonably well with human annotators. Second, we propose an LDDG approach via iterative simplification with an LLM. Experimental results indicate that definitions generated by our approach achieve high scores on our criteria while maintaining lexical simplicity.

[56] Judging with Personality and Confidence: A Study on Personality-Conditioned LLM Relevance Assessment

Nuo Chen, Hanpei Fang, Piaohong Wang, Jiqun Liu, Tetsuya Sakai, Xiao-Ming Wu

Main category: cs.CL

TL;DR: LLMs prompted with Big Five personality traits show systematic effects on relevance judgment accuracy and confidence calibration, with certain traits (low agreeableness, low conscientiousness) improving performance; personality-conditioned features enhance classifier performance.

Details

Motivation: Limited understanding of how simulated personalities in LLMs influence critical web search decisions (relevance assessment) and confidence calibration biases (overconfidence/underconfidence), despite psychological literature linking traits to these biases.

Method: Comprehensive study evaluating multiple LLMs (commercial and open-source) prompted to simulate Big Five personality traits across three test collections (TREC DL 2019, TREC DL 2020, LLMJudge), collecting relevance judgments and self-reported confidence scores for query-document pairs.

Result: Low agreeableness aligns more closely with human labels than unprompted condition; low conscientiousness balances suppression of overconfidence and underconfidence; relevance scores and confidence distributions vary systematically across personalities; personality-conditioned features in random forest classifier surpass best single-personality condition on TREC DL 2021.

Conclusion: Personality-derived confidence offers complementary predictive signal, enabling more reliable and human-aligned LLM evaluators through personality-conditioned scoring approaches.

Abstract: Recent studies have shown that prompting can enable large language models (LLMs) to simulate specific personality traits and produce behaviors that align with those traits. However, there is limited understanding of how these simulated personalities influence critical web search decisions, specifically relevance assessment. Moreover, few studies have examined how simulated personalities impact confidence calibration, specifically the tendencies toward overconfidence or underconfidence. This gap exists even though psychological literature suggests these biases are trait-specific, often linking high extraversion to overconfidence and high neuroticism to underconfidence. To address this gap, we conducted a comprehensive study evaluating multiple LLMs, including commercial models and open-source models, prompted to simulate Big Five personality traits. We tested these models across three test collections (TREC DL 2019, TREC DL 2020, and LLMJudge), collecting two key outputs for each query-document pair: a relevance judgment and a self-reported confidence score. The findings show that personalities such as low agreeableness consistently align more closely with human labels than the unprompted condition. Additionally, low conscientiousness performs well in balancing the suppression of both overconfidence and underconfidence. We also observe that relevance scores and confidence distributions vary systematically across different personalities. Based on the above findings, we incorporate personality-conditioned scores and confidence as features in a random forest classifier. This approach achieves performance that surpasses the best single-personality condition on a new dataset (TREC DL 2021), even with limited training data. These findings highlight that personality-derived confidence offers a complementary predictive signal, paving the way for more reliable and human-aligned LLM evaluators.

[57] DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs

Jinghan Ru, Siyuan Yan, Yuguo Yin, Yuexian Zou, Zongyuan Ge

Main category: cs.CL

TL;DR: DermoGPT is a dermatology multimodal LLM framework with three components: DermoInstruct (large-scale instruction corpus), DermoBench (comprehensive benchmark), and DermoGPT (MLLM with MAVIC reinforcement learning). It addresses data limitations in dermatology AI and achieves state-of-the-art performance.

Details

Motivation: Progress in dermatology MLLMs lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. Existing approaches don't capture the complete diagnostic pipeline from morphological observation to final diagnosis.

Method: Three-part framework: 1) DermoInstruct - 211,243 images with 772,675 trajectories across 5 task formats capturing full diagnostic pipeline; 2) DermoBench - benchmark with 11 tasks across 4 clinical axes including expert-verified open-ended instances; 3) DermoGPT - MLLM trained via supervised fine-tuning + MAVIC reinforcement learning (enforces consistency between visual observations and diagnostic conclusions) with CCT test-time adaptation.

Result: DermoGPT significantly outperforms 16 representative baselines across all evaluation axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. The framework demonstrates comprehensive coverage of dermatology diagnostic workflows.

Conclusion: The proposed framework addresses key limitations in dermatology MLLMs through comprehensive data, rigorous evaluation, and novel training objectives. DermoGPT shows promising results in bridging the human-AI performance gap in dermatology diagnosis and reasoning tasks.

Abstract: Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert-verified open-ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine-tuning followed by our Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence-Consistency Test-time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at https://github.com/mendicant04/DermoGPT upon acceptance.

[58] Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu

Main category: cs.CL

TL;DR: AgeMem is a unified memory framework that integrates long-term and short-term memory management directly into LLM agents’ policies through tool-based actions, trained with progressive RL to handle sparse rewards.

Details

Motivation: LLM agents struggle with long-horizon reasoning due to finite context windows. Existing memory methods treat LTM and STM as separate components with heuristic controllers, limiting adaptability and preventing end-to-end optimization.

Method: AgeMem exposes memory operations (store, retrieve, update, summarize, discard) as tool-based actions, allowing autonomous memory management. Uses three-stage progressive RL training with step-wise GRPO to handle sparse rewards from memory operations.

Result: Outperforms strong memory-augmented baselines on five long-horizon benchmarks across multiple LLM backbones, achieving better task performance, higher-quality long-term memory, and more efficient context usage.

Conclusion: AgeMem provides a unified, end-to-end trainable memory framework that enables LLM agents to autonomously manage both long-term and short-term memory, improving long-horizon reasoning capabilities.

Abstract: Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent’s policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.

[59] Tackling the Inherent Difficulty of Noise Filtering in RAG

Jingyu Liu, Jiaen Lin, Yong Liu

Main category: cs.CL

TL;DR: A novel fine-tuning method that enhances LLMs’ ability to distinguish relevant from irrelevant information in RAG systems, improving robustness against noisy retrieved documents.

Details

Motivation: RAG systems often introduce noisy or irrelevant documents that degrade performance and cause hallucinations. Existing filtering methods are limited, and standard fine-tuning fails to make LLMs robust against such noise due to attention pattern constraints.

Method: Proposes a novel fine-tuning method specifically designed to enhance LLMs’ ability to distinguish between relevant and irrelevant information within retrieved documents, addressing the structural limitations of attention patterns.

Result: Extensive experiments across multiple benchmarks show the approach significantly improves LLMs’ robustness and performance against noisy retrieved documents.

Conclusion: The proposed fine-tuning method effectively enhances LLMs’ ability to selectively utilize relevant information while ignoring irrelevant content in RAG systems, overcoming limitations of existing approaches.

Abstract: Retrieval-Augmented Generation (RAG) has become a widely adopted approach to enhance Large Language Models (LLMs) by incorporating external knowledge and reducing hallucinations. However, noisy or irrelevant documents are often introduced during RAG, potentially degrading performance and even causing hallucinated outputs. While various methods have been proposed to filter out such noise, we argue that identifying irrelevant information from retrieved content is inherently difficult and limited number of transformer layers can hardly solve this. Consequently, retrievers fail to filter out irrelevant documents entirely. Therefore, LLMs must be robust against such noise, but we demonstrate that standard fine-tuning approaches are often ineffective in enabling the model to selectively utilize relevant information while ignoring irrelevant content due to the structural constraints of attention patterns. To address this, we propose a novel fine-tuning method designed to enhance the model’s ability to distinguish between relevant and irrelevant information within retrieved documents. Extensive experiments across multiple benchmarks show that our approach significantly improves the robustness and performance of LLMs.

[60] CSF: Contrastive Semantic Features for Direct Multilingual Sign Language Generation

Tran Sy Bao

Main category: cs.CL

TL;DR: CSF is a language-agnostic semantic representation framework that enables direct translation from any source language to sign language without English mediation, using nine universal semantic slots and achieving 99.03% extraction accuracy across four languages.

Details

Motivation: Current sign language translation systems require English as an intermediary language, creating barriers for non-English speakers in the global deaf community. There's a need for language-agnostic solutions that can work directly from any source language to sign language.

Method: Developed Canonical Semantic Form (CSF) - a semantic representation framework with nine universal semantic slots. Created a comprehensive condition taxonomy with 35 condition types across eight categories. Trained a lightweight transformer-based extractor (0.74 MB) to extract these semantic slots from source languages.

Result: Achieved 99.03% average slot extraction accuracy across English, Vietnamese, Japanese, and French. Particularly strong condition classification accuracy of 99.4% despite 35-class complexity. Inference latency of 3.02ms on CPU enables real-time browser-based applications.

Conclusion: CSF provides an effective language-agnostic framework for direct sign language translation, eliminating English dependency. The lightweight model enables real-time applications, making sign language technology more accessible globally. Resources are released to support further research.

Abstract: Sign language translation systems typically require English as an intermediary language, creating barriers for non-English speakers in the global deaf community. We present Canonical Semantic Form (CSF), a language-agnostic semantic representation framework that enables direct translation from any source language to sign language without English mediation. CSF decomposes utterances into nine universal semantic slots: event, intent, time, condition, agent, object, location, purpose, and modifier. A key contribution is our comprehensive condition taxonomy comprising 35 condition types across eight semantic categories, enabling nuanced representation of conditional expressions common in everyday communication. We train a lightweight transformer-based extractor (0.74 MB) that achieves 99.03% average slot extraction accuracy across four typologically diverse languages: English, Vietnamese, Japanese, and French. The model demonstrates particularly strong performance on condition classification (99.4% accuracy) despite the 35-class complexity. With inference latency of 3.02ms on CPU, our approach enables real-time sign language generation in browser-based applications. We release our code, trained models, and multilingual dataset to support further research in accessible sign language technology.

[61] Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier, Chris Develder, Thomas Demeester

Main category: cs.CL

TL;DR: SSMs like Mamba are vulnerable to Hidden State Poisoning Attacks (HiSPA) where short input phrases cause irreversible information loss, unlike Transformers which remain robust.

Details

Motivation: While SSMs offer computational efficiency over Transformers, their adversarial robustness remains unexplored, particularly against attacks that manipulate hidden states to cause information loss.

Method: Developed RoBench25 benchmark to evaluate information retrieval under HiSPA attacks, tested on SSMs including a 52B hybrid SSM-Transformer (Jamba), and conducted interpretability analysis of hidden layer patterns during attacks.

Result: SSMs are vulnerable to HiSPA attacks (even 52B Jamba model collapses), while pure Transformers remain robust. HiSPA triggers also weaken Jamba on Open-Prompt-Injections benchmark. Hidden layer patterns during attacks could enable mitigation systems.

Conclusion: SSMs have critical security vulnerabilities to hidden state poisoning attacks that require mitigation strategies, unlike Transformers which demonstrate robustness against such attacks.

Abstract: State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model’s information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba’s hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.

[62] Surprisal and Metaphor Novelty: Moderate Correlations and Divergent Scaling Effects

Omar Momen, Emilie Sitter, Berenike Herrmann, Sina Zarrieß

Main category: cs.CL

TL;DR: Surprisal from language models shows moderate correlation with metaphor novelty annotations, but exhibits divergent scaling patterns on corpus-based vs. synthetic datasets.

Details

Motivation: Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models' ability to capture these phenomena through probabilistic measures like surprisal.

Method: Analyzed surprisal from 16 LM variants on both corpus-based and synthetic metaphor novelty datasets, using a cloze-style surprisal method that conditions on full-sentence context.

Result: LMs yield significant moderate correlations with metaphor novelty scores/labels. Divergent scaling patterns observed: on corpus-based data, correlation strength decreases with model size (inverse scaling effect), while on synthetic data it increases (Quality-Power Hypothesis).

Conclusion: While surprisal can partially account for annotations of metaphor novelty, it remains a limited metric of linguistic creativity, suggesting the need for more sophisticated measures to capture the full complexity of novel metaphor comprehension.

Abstract: Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with different metaphor novelty datasets. We analyse surprisal from 16 LM variants on corpus-based and synthetic metaphor novelty datasets. We explore a cloze-style surprisal method that conditions on full-sentence context. Results show that LMs yield significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus-based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (Quality-Power Hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains a limited metric of linguistic creativity.

[63] Not All Needles Are Found: How Fact Distribution and Don’t Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs

Amirali Ebrahimzadeh, Seyyed M. Salili

Main category: cs.CL

TL;DR: Longer LLM contexts don’t guarantee better performance; realistic evidence distribution and anti-hallucination prompts significantly affect extraction and inference accuracy across models.

Details

Motivation: As LLMs support longer inputs, it's unclear how reliably they extract and infer information at scale, especially with real-world document distributions and enterprise workflows involving large unfiltered documents.

Method: Extended needle-in-a-haystack benchmark across four production models (Gemini-2.5-flash, ChatGPT-5-mini, Claude-4.5-haiku, Deepseek-v3.2-chat), evaluating literal extraction, logical inference, and hallucination risk with positional effects, realistic evidence distributions, and anti-hallucination prompts.

Result: Longer contexts can be detrimental when evidence is diluted/dispersed; model performance varies substantially; anti-hallucination instructions can make models overly conservative, reducing accuracy; models struggle to identify/prioritize relevant information even when present.

Conclusion: Effective context length and model-specific robustness to long contexts are critical for reliable LLM deployment, as many failures stem from ineffective context utilization rather than retrieval limitations.

Abstract: Large language models (LLMs) increasingly support very long input contexts. Yet it remains unclear how reliably they extract and infer information at scale. Performance varies with context length and strongly interacts with how information is distributed in real-world corpora. Motivated by these observations, we study how fact placement, corpus-level fact distributions, and Don’t Make It Up prompts influence model behavior. We introduce an extended needle-in-a-haystack benchmark across four production-scale models: Gemini-2.5-flash, ChatGPT-5-mini, Claude-4.5-haiku, and Deepseek-v3.2-chat. Unlike prior work, we separately evaluate literal extraction, logical inference, and hallucination risk. Our study considers both positional effects and realistic distributions of evidence across long contexts, as well as prompts that explicitly discourage fabrication. We find that longer contexts alone do not guarantee better performance and can be detrimental when relevant evidence is diluted or widely dispersed. Performance varies substantially across models: some show severe degradation under realistic conditions, while others remain more robust at longer context lengths. Anti-hallucination (AH) instructions can make some models overly conservative, sharply reducing accuracy in literal extraction and logical inference. While we do not directly compare retrieval-augmented generation (RAG) and cache-augmented generation (CAG), our results suggest many failures stem from ineffective context utilization. Models often struggle to identify and prioritize relevant information even when it is present. These findings have direct practical implications, as enterprise workflows increasingly involve pasting large volumes of unfiltered documents into LLM prompts. Effective context length and model-specific robustness to long contexts are therefore critical for reliable LLM deployment in research and business.

[64] Cost-Efficient Cross-Lingual Retrieval-Augmented Generation for Low-Resource Languages: A Case Study in Bengali Agricultural Advisory

Md. Asif Hossain, Nabil Subhan, Mantasha Rahman Mahi, Jannatul Ferdous Nabila

Main category: cs.CL

TL;DR: A cross-lingual RAG framework for Bengali agricultural advisory that translates queries to English, retrieves from English manuals, and translates responses back to Bengali, achieving factual grounding with low latency on consumer hardware.

Details

Motivation: Agricultural advisory in developing regions faces language barriers - authoritative manuals are in English while farmers speak low-resource languages like Bengali. Direct LLM generation in these languages suffers from poor fluency and factual inconsistency, and cloud solutions are too expensive.

Method: Translation-centric RAG architecture: Bengali queries are translated to English with domain-specific keyword injection to align farmer terminology with scientific terms, then answered via dense vector retrieval over curated English agricultural manuals (FAO, IRRI). English responses are translated back to Bengali. Uses open-source models and runs on consumer hardware without paid APIs.

Result: System achieves reliable source-grounded responses, robust rejection of out-of-domain queries, and average end-to-end latency below 20 seconds. Demonstrates practical deployability on consumer-grade hardware.

Conclusion: Cross-lingual retrieval combined with controlled translation offers a practical and scalable solution for agricultural knowledge access in low-resource language settings, addressing both language barriers and cost constraints.

Abstract: Access to reliable agricultural advisory remains limited in many developing regions due to a persistent language barrier: authoritative agricultural manuals are predominantly written in English, while farmers primarily communicate in low-resource local languages such as Bengali. Although recent advances in Large Language Models (LLMs) enable natural language interaction, direct generation in low-resource languages often exhibits poor fluency and factual inconsistency, while cloud-based solutions remain cost-prohibitive. This paper presents a cost-efficient, cross-lingual Retrieval-Augmented Generation (RAG) framework for Bengali agricultural advisory that emphasizes factual grounding and practical deployability. The proposed system adopts a translation-centric architecture in which Bengali user queries are translated into English, enriched through domain-specific keyword injection to align colloquial farmer terminology with scientific nomenclature, and answered via dense vector retrieval over a curated corpus of English agricultural manuals (FAO, IRRI). The generated English response is subsequently translated back into Bengali to ensure accessibility. The system is implemented entirely using open-source models and operates on consumer-grade hardware without reliance on paid APIs. Experimental evaluation demonstrates reliable source-grounded responses, robust rejection of out-of-domain queries, and an average end-to-end latency below 20 seconds. The results indicate that cross-lingual retrieval combined with controlled translation offers a practical and scalable solution for agricultural knowledge access in low-resource language settings

[65] Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows

Yingte Shu, Yuchuan Tian, Chao Xu, Yunhe Wang, Hanting Chen

Main category: cs.CL

TL;DR: DCD is a training-free decoding strategy for diffusion language models that uses confidence-aware sliding windows to defer uncertain token commitments, improving generation quality by 1.39% on average compared to fixed block-based methods.

Details

Motivation: Block-based diffusion in DLMs suffers from Boundary-Induced Context Truncation (BICT), where undecoded tokens near block boundaries must commit without access to nearby future context, degrading decoding confidence and generation quality, especially for reasoning tasks.

Method: Deferred Commitment Decoding (DCD) maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available, enabling bidirectional information flow without sacrificing efficiency.

Result: DCD improves generation accuracy by 1.39% on average with comparable time, with the most significant improvement reaching 9.0% across multiple diffusion language models, benchmarks, and caching configurations.

Conclusion: Deferring token commitment based on uncertainty is a simple yet effective principle for improving both quality and efficiency of diffusion language model decoding, addressing structural limitations of block-based approaches.

Abstract: Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.

[66] DeCode: Decoupling Content and Delivery for Medical QA

Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng

Main category: cs.CL

TL;DR: DeCode is a training-free, model-agnostic framework that adapts LLMs to produce contextualized clinical answers by accounting for individual patient contexts, achieving 75% relative improvement on OpenAI HealthBench.

Details

Motivation: Existing LLMs demonstrate strong medical knowledge but often fail to account for individual patient contexts, producing clinically correct answers that are poorly aligned with patients' actual needs and situations.

Method: DeCode is a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings without requiring additional model training.

Result: DeCode improves performance on OpenAI HealthBench from 28.4% to 49.8%, representing a 75% relative improvement over previous state-of-the-art methods.

Conclusion: DeCode effectively improves clinical question answering of LLMs by enabling them to generate responses that are both clinically valid and properly contextualized to individual patient needs.

Abstract: Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients’ needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4%$ to $49.8%$, corresponding to a $75%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.

[67] Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

Main category: cs.CL

TL;DR: kNN-MoE introduces retrieval-augmented routing for Mixture-of-Experts models, using a memory of optimal expert assignments to improve routing robustness under distribution shifts.

Details

Motivation: Standard MoE architectures use frozen parametric routers that become brittle under distribution shifts, limiting their adaptability to new data distributions.

Method: Proposes kNN-MoE with retrieval-augmented routing that reuses optimal expert assignments from a memory of similar past cases. The memory is constructed offline by optimizing routing logits on a reference set, and uses aggregate neighbor similarity as a confidence-driven mixing coefficient to fall back to the frozen router when needed.

Result: kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning in experiments.

Conclusion: Retrieval-augmented routing provides an effective way to make MoE routing more robust to distribution shifts without requiring expensive retraining or fine-tuning.

Abstract: Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric “router” to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.

[68] FormationEval, an open multiple-choice benchmark for petroleum geoscience

Almaz Ermilov

Main category: cs.CL

TL;DR: FormationEval is an open benchmark with 505 multiple-choice questions for evaluating language models on petroleum geoscience topics, showing top models achieve over 97% accuracy with Gemini 3 Pro Preview reaching 99.8%.

Details

Motivation: There's a need for specialized evaluation benchmarks in petroleum geoscience and subsurface disciplines to assess language model capabilities in this technical domain, with proper traceability and copyright considerations.

Method: Created 505 questions across 7 domains using a reasoning model with detailed instructions and concept-based approach to avoid verbatim copying, derived from three authoritative sources with source metadata for traceability.

Result: Evaluated 72 models: top performers achieve over 97% accuracy (Gemini 3 Pro Preview 99.8%); open-weight GLM-4.7 leads at 98.6%; petrophysics is most challenging domain; open-weight models perform surprisingly well with several exceeding 90% accuracy.

Conclusion: The benchmark reveals strong performance across models in petroleum geoscience, with narrower-than-expected gap between open-weight and closed models, and provides valuable insights into domain-specific challenges and model capabilities.

Abstract: This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.

[69] Confidence Estimation for LLMs in Multi-turn Interactions

Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang River Dong, Deqing Yang, Nigel Collier

Main category: cs.CL

TL;DR: First systematic study of confidence estimation in multi-turn LLM conversations, showing current methods struggle with calibration and monotonicity as context accumulates.

Details

Motivation: Current confidence estimation research focuses on single-turn settings, but multi-turn conversations where context accumulates and ambiguity resolves progressively are critical for applications like autonomous agents and human-in-the-loop systems.

Method: Established formal evaluation framework with two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. Introduced novel metrics including length-normalized Expected Calibration Error (InfoECE) and a new “Hinter-Guesser” paradigm for generating controlled evaluation datasets.

Result: Widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. Proposed P(Sufficient), a logit-based probe that achieves comparatively better performance, though the task remains far from solved.

Conclusion: Provides foundational methodology for developing more reliable and trustworthy conversational agents by systematically studying confidence estimation in multi-turn interactions.

Abstract: While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new “Hinter-Guesser” paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.

[70] Toward Global Large Language Models in Medicine

Rui Yang, Huitao Li, Weihao Xuan, Heli Qi, Xin Li, Kunyu Yu, Yingjian Chen, Rongrong Wang, Jacques Behmoaras, Tianxi Cai, Bibhas Chakraborty, Qingyu Chen, Lionel Tim-Ee Cheng, Marie-Louise Damwanza, Chido Dzinotyiwei, Aosong Feng, Chuan Hong, Yusuke Iwasawa, Yuhe Ke, Linah Kitala, Taehoon Ko, Jisan Lee, Irene Li, Jonathan Chong Kai Liew, Hongfang Liu, Lian Leng Low, Edison Marrese-Taylor, Yutaka Matsuo, Isheanesu Misi, Yilin Ning, Jasmine Chiat Ling Ong, Marcus Eng Hock Ong, Enrico Petretto, Hossein Rouhizadeh, Abiram Sandralegar, Oren Schreier, Iain Bee Huat Tan, Patrick Tan, Daniel Shu Wei Ting, Junjue Wang, Chunhua Weng, Matthew Yu Heng Wong, Fang Wu, Yunze Xiao, Xuhai Xu, Qingcheng Zeng, Zhuo Zheng, Yifan Peng, Douglas Teodoro, Nan Liu

Main category: cs.CL

TL;DR: Researchers created GlobMed, a multilingual medical dataset with 500K+ entries across 12 languages (including 4 low-resource languages), developed GlobMed-Bench for evaluating 56 LLMs on multilingual medical tasks, and trained GlobMed-LLMs models that show 40%+ average performance improvement with 3x+ gains on low-resource languages.

Details

Motivation: Despite advances in medical technology, global healthcare resource distribution remains uneven. LLMs can improve healthcare quality and access but are primarily trained on high-resource languages, limiting their global medical applicability.

Method: 1) Constructed GlobMed dataset with 500K+ entries across 12 languages including 4 low-resource languages. 2) Created GlobMed-Bench to systematically assess 56 state-of-the-art LLMs across multilingual medical tasks. 3) Developed GlobMed-LLMs suite (1.7B to 8B parameters) trained on GlobMed dataset.

Result: GlobMed-Bench revealed significant performance disparities across languages, especially for low-resource languages. GlobMed-LLMs achieved over 40% average performance improvement relative to baseline models, with more than threefold increase in performance on low-resource languages.

Conclusion: These resources provide an important foundation for advancing equitable development and application of LLMs globally, enabling broader language communities to benefit from technological advances in healthcare.

Abstract: Despite continuous advances in medical technology, the global distribution of health care resources remains uneven. The development of large language models (LLMs) has transformed the landscape of medicine and holds promise for improving health care quality and expanding access to medical information globally. However, existing LLMs are primarily trained on high-resource languages, limiting their applicability in global medical scenarios. To address this gap, we constructed GlobMed, a large multilingual medical dataset, containing over 500,000 entries spanning 12 languages, including four low-resource languages. Building on this, we established GlobMed-Bench, which systematically assesses 56 state-of-the-art proprietary and open-weight LLMs across multiple multilingual medical tasks, revealing significant performance disparities across languages, particularly for low-resource languages. Additionally, we introduced GlobMed-LLMs, a suite of multilingual medical LLMs trained on GlobMed, with parameters ranging from 1.7B to 8B. GlobMed-LLMs achieved an average performance improvement of over 40% relative to baseline models, with a more than threefold increase in performance on low-resource languages. Together, these resources provide an important foundation for advancing the equitable development and application of LLMs globally, enabling broader language communities to benefit from technological advances.

[71] From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

Fabian Lukassen, Jan Herrmann, Christoph Weisser, Benjamin Saefken, Thomas Kneib

Main category: cs.CL

TL;DR: Systematic study shows LLM choice dominates XAI explanation quality; XAI provides minimal benefits over no-XAI baselines, mainly for experts; SARIMAX shows interpretability paradox with lower explanation quality despite higher accuracy.

Details

Motivation: XAI methods produce numerical feature attributions that are inaccessible to non-experts. While LLMs can transform these into natural language explanations, it's unclear what factors contribute to high-quality explanations, prompting a systematic investigation.

Method: Factorial study examining four factors: forecasting model choice (XGBoost, Random Forest, MLP, SARIMAX), XAI method (SHAP, LIME, no-XAI baseline), LLM selection (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Evaluated 660 explanations using G-Eval with dual LLM judges across four criteria.

Result: 1) XAI provides only small improvements over no-XAI baselines, mainly for expert audiences. 2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3. 3) Interpretability paradox: SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy. 4) Zero-shot prompting is competitive with self-consistency at 7-times lower cost. 5) Chain-of-thought hurts rather than helps explanation quality.

Conclusion: LLM selection is the most critical factor for generating high-quality natural language explanations from XAI outputs, while XAI methods provide limited benefits. The interpretability paradox suggests that more accurate models don’t necessarily produce better explanations, and simpler prompting strategies can be cost-effective without sacrificing quality.

Abstract: Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.

[72] CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, Niraj K. Jha

Main category: cs.CL

TL;DR: CD4LM enables highly parallel decoding for diffusion language models with 3-5x speedup while maintaining or improving accuracy through consistency distillation and confidence-adaptive decoding.

Details

Motivation: Autoregressive LLMs suffer from sequential latency limitations, while diffusion language models promise parallel generation but face static-to-dynamic misalignment between training and inference requirements.

Method: Proposes CD4LM framework with Discrete-Space Consistency Distillation (DSCD) to train trajectory-invariant students, and Confidence-Adaptive Decoding (CAD) to dynamically allocate compute based on token confidence.

Result: Achieves 5.18x wall-clock speedup on GSM8K while matching baseline accuracy; across code and math benchmarks, achieves 3.62x mean speedup while improving average accuracy.

Conclusion: CD4LM enables efficient parallel decoding for diffusion language models by decoupling training from inference, strictly dominating the accuracy-efficiency Pareto frontier.

Abstract: Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive “long-jump” refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup; across code and math benchmarks, it strictly dominates the accuracy-efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at https://github.com/yihao-liang/CDLM

[73] pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

Tobias Schimanski, Imene Kolli, Jingwei Ni, Yu Fan, Ario Saeid Vaghefi, Elliott Ash, Markus Leippold

Main category: cs.CL

TL;DR: pdfQA is a new QA dataset created from PDF documents with 2K human-annotated and 2K synthetic QA pairs across ten complexity dimensions, designed to evaluate end-to-end QA systems on challenging PDF-based questions.

Details

Motivation: PDFs are the second-most used document type online, but existing QA datasets either start from text sources or only cover specific domains, lacking comprehensive evaluation of PDF-based question answering.

Method: Created pdfQA dataset with 2K human-annotated (real-pdfQA) and 2K synthetic (syn-pdfQA) QA pairs across ten complexity dimensions including file type, source modality, source position, and answer type. Applied quality and difficulty filters to obtain valid and challenging QA pairs.

Result: Evaluated open-source LLMs on the dataset, revealing challenges that correlate with the complexity dimensions. The dataset provides a basis for end-to-end QA pipeline evaluation and testing diverse skill sets.

Conclusion: pdfQA addresses the gap in PDF-based QA evaluation, offering a multi-domain dataset with annotated complexity dimensions that can test various aspects of QA systems including information retrieval and parsing optimizations.

Abstract: PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).

[74] Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLMs)

Mahmoud Elgenedy

Main category: cs.CL

TL;DR: The paper proposes power-of-two (PoT) quantization for LLMs to reduce memory usage by 87.5% and accelerate inference 3-10x by replacing multiplications with bit shifts, using QAT to recover performance.

Details

Motivation: LLMs are growing exponentially in size (billions to trillions of parameters), creating implementation challenges for edge devices with limited memory and processing power. Cloud computing isn't feasible for edge deployment, requiring novel compression techniques.

Method: Uses power-of-two (PoT) quantization where weights are constrained to only PoT values, storing only exponents. This enables replacing costly multiplications with low-cost bit shifts. Employs Quantization Aware Training (QAT) to recover performance loss from strict quantization through additional training.

Result: On GPT-2 124M: 66% perplexity improvement for quantized PoT model after QAT, with only 1% BERT-Score loss compared to baseline. Achieves 87.5% memory saving and 3-10x faster inference speed compared to full-precision models.

Conclusion: PoT quantization with QAT enables efficient LLM deployment on edge devices by dramatically reducing memory requirements and accelerating inference while maintaining competitive performance through additional training.

Abstract: In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware Training (QAT) to enhance performance through additional training. Results on GPT-2 124M show a major enhancement for quantized PoT model after additional training, with a perplexity enhancement of 66% and BERT-Score loss to baseline GPT-2 of 1%. The memory saving is estimated to be 87.5% while the inference speed is expected to be 3-10x faster with PoT quantization versus full-precision.

[75] Classifying several dialectal Nawatl varieties

Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Carlos-Emiliano González-Gallardo, Graham Ranger, Martha Lorena-Avendaño-Garrido

Main category: cs.CL

TL;DR: This paper addresses the classification of Nawatl dialectal varieties using machine learning and neural networks, focusing on a language with rich cultural heritage but limited computational resources.

Details

Motivation: Nawatl is the most widely spoken indigenous language in Mexico with over 2 million speakers, but has few computational resources despite its rich cultural heritage dating back to the 15th century. The problem is compounded by approximately 30 recognized dialectal varieties and different spelling conventions in written forms.

Method: The researchers used Machine Learning and Neural Networks to classify Nawatl varieties, though specific algorithms and architectures are not detailed in the abstract.

Result: Results are not specified in the abstract, but the paper presents work on classifying Nawatl dialectal varieties using computational methods.

Conclusion: The research contributes to addressing the computational resource gap for Nawatl by developing classification methods for its dialectal varieties using modern machine learning techniques.

Abstract: Mexico is a country with a large number of indigenous languages, among which the most widely spoken is Nawatl, with more than two million people currently speaking it (mainly in North and Central America). Despite its rich cultural heritage, which dates back to the 15th century, Nawatl is a language with few computer resources. The problem is compounded when it comes to its dialectal varieties, with approximately 30 varieties recognised, not counting the different spellings in the written forms of the language. In this research work, we addressed the problem of classifying Nawatl varieties using Machine Learning and Neural Networks.

[76] Estimating Text Temperature

Nikolay Mikhaylovskiy

Main category: cs.CL

TL;DR: Proposes method to estimate temperature parameter of text (including human-written) relative to a language model, evaluates various LLMs, and applies best model to analyze popular corpora.

Details

Motivation: Autoregressive language models use temperature to control randomness during text generation, but this parameter can be estimated post-generation. The paper aims to develop a method to estimate temperature for any text (even human-written) relative to a given language model.

Method: Proposes a procedure to estimate temperature using maximum likelihood approach after text generation. Evaluates temperature estimation capability across a wide selection of small-to-medium LLMs, then uses the best-performing model (Qwen3 14B) to estimate temperatures of popular corpora.

Result: Identifies Qwen3 14B as the best-performing model for temperature estimation among evaluated small-to-medium LLMs. Applies this model to estimate temperatures of popular text corpora.

Conclusion: Temperature parameter can be estimated for any text relative to a language model, with Qwen3 14B showing strong performance for this task, enabling analysis of temperature characteristics in various text corpora.

Abstract: Autoregressive language models typically use temperature parameter at inference to shape the probability distribution and control the randomness of the text generated. After the text was generated, this parameter can be estimated using maximum likelihood approach. Following it, we propose a procedure to estimate the temperature of any text, including ones written by humans, with respect to a given language model. We evaluate the temperature estimation capability of a wide selection of small-to-medium LLMs. We then use the best-performing Qwen3 14B to estimate temperatures of popular corpora.

[77] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

Berk Atil, Rebecca J. Passonneau, Ninareh Mehrabi

Main category: cs.CL

TL;DR: The paper evaluates persona-aware toxicity detection in LLMs, finding no single prompting method works best across all model-persona pairs, and proposes an SVM-based meta-ensemble that outperforms individual methods and traditional voting.

Details

Motivation: Toxicity detection is subjective and influenced by different demographic perspectives. Current LLM prompting techniques yield inconsistent results across personas and models, highlighting the need for robust pluralistic evaluation methods.

Method: Systematic evaluation of persona-aware toxicity detection, including automated prompt optimization. Exploration of ensembling four prompting variants and proposing a lightweight meta-ensemble using SVM over 4-bit vectors of prompt predictions.

Result: No single prompting method uniformly dominates across all model-persona pairs. The proposed SVM ensemble consistently outperforms individual prompting methods and traditional majority-voting techniques, achieving strongest overall performance across diverse personas.

Conclusion: This work provides one of the first systematic comparisons of persona-conditioned prompting for toxicity detection and offers a robust method for pluralistic evaluation in subjective NLP tasks through SVM-based meta-ensembling.

Abstract: Toxicity detection is inherently subjective, shaped by the diverse perspectives and social priors of different demographic groups. While ``pluralistic’’ modeling as used in economics and the social sciences aims to capture perspective differences across contexts, current Large Language Model (LLM) prompting techniques have different results across different personas and base models. In this work, we conduct a systematic evaluation of persona-aware toxicity detection, showing that no single prompting method, including our proposed automated prompt optimization strategy, uniformly dominates across all model-persona pairs. To exploit complementary errors, we explore ensembling four prompting variants and propose a lightweight meta-ensemble: an SVM over the 4-bit vector of prompt predictions. Our results demonstrate that the proposed SVM ensemble consistently outperforms individual prompting methods and traditional majority-voting techniques, achieving the strongest overall performance across diverse personas. This work provides one of the first systematic comparisons of persona-conditioned prompting for toxicity detection and offers a robust method for pluralistic evaluation in subjective NLP tasks.

[78] GRACE: Discriminator-Guided Chain-of-Thought Reasoning

Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang

Main category: cs.CL

TL;DR: GRACE is a stepwise decoding method that uses a correctness discriminator to guide chain-of-thought reasoning, improving solution accuracy without requiring LM training.

Details

Motivation: Language models often assign high likelihood to incorrect reasoning steps in multi-step problems, causing decoding strategies that optimize for solution likelihood to produce wrong answers.

Method: GRACE uses a step-level verifier trained with contrastive loss over correct/incorrect steps to score next-step candidates during decoding. It only requires sampling from the LM without training or fine-tuning.

Result: GRACE shows substantial performance gains over greedy decoding, verifiers, and self-consistency on four math and two symbolic reasoning tasks. Combined with self-consistency, it outperforms all baselines by sizeable margins.

Conclusion: GRACE improves both final answer accuracy and intermediate reasoning correctness, providing an effective approach to steer chain-of-thought reasoning toward correct solutions.

Abstract: In the context of multi-step reasoning, e.g., with chain-of-thought, language models (LMs) can easily assign a high likelihood to incorrect steps. As a result, decoding strategies that optimize for solution likelihood often yield incorrect solutions. To address this issue, we propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that steers the decoding process towards producing correct reasoning steps. GRACE employs a step-level verifier or discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates based on their correctness. Importantly, GRACE only requires sampling from the LM, without the need for LM training or fine-tuning. Using models from FLAN-T5 and LLaMA families, we evaluate GRACE over four math and two symbolic reasoning tasks, where it exhibits substantial performance gains compared to greedy decoding, verifiers, and self-consistency in most settings. When further combined with self-consistency, GRACE outperforms all the baselines by sizeable margins. Human and LLM evaluations over GSM8K show that GRACE not only improves the final answer accuracy but also the correctness of the intermediate reasoning. Our implementation can be accessed at https://github.com/mukhal/grace.

[79] Context-aware Decoding Reduces Hallucination in Query-focused Summarization

Zhichao Xu

Main category: cs.CL

TL;DR: Large-scale reproducibility study of Context-aware Decoding (CAD) for query-focused summarization shows it reduces hallucinations while maintaining ROUGE scores, but increases computational cost.

Details

Motivation: Query-focused summarization using LLMs suffers from hallucinations, especially when evidence contradicts LLM priors. Need to evaluate decoding methods like CAD that aim to improve generation quality and reduce hallucinations.

Method: Conducted large-scale reproducibility study of Context-aware Decoding (CAD) across 8 language models. Extended original experiments to QFS datasets with rigorous analysis of computational complexity and hyperparameter sensitivity.

Result: CAD improves QFS quality by reducing factuality errors/hallucinations while mostly retaining ROUGE scores. However, it increases inference-time FLOPs and reduces decoding speed.

Conclusion: CAD is effective for reducing hallucinations in QFS but comes with computational trade-offs. The method shows promise for improving LLM-based summarization systems.

Abstract: Query-focused summarization (QFS) aims to provide a summary of a single document/multi documents that can satisfy the information needs of a given query. It is useful for various real-world applications, such as abstractive snippet generation or more recent retrieval augmented generation (RAG). A prototypical QFS pipeline consists of a retriever (sparse or dense retrieval) and a generator (usually a large language model). However, applying large language models (LLM) potentially leads to hallucinations, especially when the evidence contradicts the prior belief of LLMs. There has been growing interest in developing new decoding methods to improve generation quality and reduce hallucination. In this work, we conduct a large-scale reproducibility study on one recently proposed decoding method, – ,Context-aware Decoding (CAD). In addition to replicating CAD’s experiments on news summarization datasets, we include experiments on QFS datasets, and conduct more rigorous analysis on computational complexity and hyperparameter sensitivity. Experiments with eight different language models show that performance-wise, CAD improves QFS quality by (1) reducing factuality errors/hallucinations while (2) mostly retaining the match of lexical patterns, measured by ROUGE scores, while also at a cost of increased inference-time FLOPs and reduced decoding speed. The \href{https://github.com/zhichaoxu-shufe/context-aware-decoding-qfs}{code implementation} based on Huggingface Library is made available

[80] LTLBench: Towards Benchmarks for Evaluating Temporal Reasoning in Large Language Models

Weizhi Tang, Kwabena Nuamah, Vaishak Belle

Main category: cs.CL

TL;DR: Researchers propose using Linear Temporal Logic (LTL) to evaluate LLMs’ temporal reasoning abilities, creating LTLBench dataset with 2000 challenges and benchmarking 12 LLMs across 5 methods.

Details

Motivation: Temporal reasoning is crucial for LLMs to understand temporal information and event relationships, but existing evaluation methods have limitations. The authors seek an alternative perspective using formal logic to better assess TR capabilities.

Method: Developed an automated pipeline to synthesize temporal reasoning challenges using Linear Temporal Logic (LTL), created LTLBench dataset with 2000 challenges, benchmarked 12 LLMs across 5 different evaluation methods, and analyzed impact of formula complexity and event count.

Result: The study revealed 3 main issues in LLMs’ temporal reasoning processes and unexpected performance changes as problem complexity increases, providing insights into how LLMs handle temporal reasoning challenges.

Conclusion: The LTL-based evaluation approach offers valuable insights into LLMs’ temporal reasoning abilities, highlighting specific weaknesses and complexity-related performance patterns that can guide future model development and evaluation.

Abstract: Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely LTLBench, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.

[81] RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine

Jiatan Huang, Mingchen Li, Zonghai Yao, Dawei Li, Yuxin Zhang, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, Hong Yu

Main category: cs.CL

TL;DR: RiTeK is a new benchmark dataset for evaluating LLM-based retrieval systems on medical Textual Knowledge Graphs, addressing limitations in existing medical TKGs and revealing deficiencies in current retrieval methods.

Details

Motivation: Medical question answering requires accurate retrieval from medical TKGs, but current systems face three main bottlenecks: scarcity of medical TKGs, limited expressiveness of topological structures, and lack of comprehensive evaluations for medical TKG retrievers.

Method: Developed RiTeK dataset by synthesizing realistic user queries with diverse topological structures, relational information, and complex textual descriptions, followed by rigorous medical expert evaluation to validate query quality.

Result: Evaluation of 11 representative retrievers on RiTeK benchmark shows existing methods struggle to perform well, revealing notable limitations in current LLM-driven retrieval approaches for medical TKGs.

Conclusion: The findings highlight the pressing need for more effective retrieval systems tailored for semi-structured data in the medical domain, with RiTeK serving as a comprehensive benchmark for future development.

Abstract: Answering complex real-world questions in the medical domain often requires accurate retrieval from medical Textual Knowledge Graphs (medical TKGs), as the relational path information from TKGs could enhance the inference ability of Large Language Models (LLMs). However, the main bottlenecks lie in the scarcity of existing medical TKGs, the limited expressiveness of their topological structures, and the lack of comprehensive evaluations of current retrievers for medical TKGs. To address these challenges, we first develop a Dataset1 for LLMs Complex Reasoning over medical Textual Knowledge Graphs (RiTeK), covering a broad range of topological structures. Specifically, we synthesize realistic user queries integrating diverse topological structures, relational information, and complex textual descriptions. We conduct a rigorous medical expert evaluation process to assess and validate the quality of our synthesized queries. RiTeK also serves as a comprehensive benchmark dataset for evaluating the capabilities of retrieval systems built upon LLMs. By assessing 11 representative retrievers on this benchmark, we observe that existing methods struggle to perform well, revealing notable limitations in current LLM-driven retrieval approaches. These findings highlight the pressing need for more effective retrieval systems tailored for semi-structured data in the medical domain.

[82] Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research

Tianyang Zhong, Zhenyuan Yang, Zhengliang Liu, Ruidong Zhang, Weihang You, Yiheng Liu, Haiyang Sun, Yi Pan, Yiwei Li, Yifan Zhou, Hanqi Jiang, Junhao Chen, Tianming Liu

Main category: cs.CL

TL;DR: LLMs offer transformative potential for studying low-resource languages despite data scarcity challenges, enabling interdisciplinary approaches to preserve linguistic and cultural heritage.

Details

Motivation: Low-resource languages are invaluable repositories of human history and cultural diversity but face critical challenges including data scarcity and technological limitations that hinder their comprehensive study and preservation.

Method: Systematically evaluates LLM applications in low-resource language research through analysis of technical frameworks, current methodologies, and ethical considerations across linguistic variation, historical documentation, cultural expressions, and literary analysis.

Result: Identifies key challenges such as data accessibility, model adaptability, and cultural sensitivity, while emphasizing interdisciplinary collaboration and customized model development as promising avenues for advancing research.

Conclusion: Integrating AI with humanities can preserve humanity’s linguistic and cultural heritage, fostering global efforts to safeguard intellectual diversity through LLM applications in low-resource language research.

Abstract: Low-resource languages serve as invaluable repositories of human history, embodying cultural evolution and intellectual diversity. Despite their significance, these languages face critical challenges, including data scarcity and technological limitations, which hinder their comprehensive study and preservation. Recent advancements in large language models (LLMs) offer transformative opportunities for addressing these challenges, enabling innovative methodologies in linguistic, historical, and cultural research. This study systematically evaluates the applications of LLMs in low-resource language research, encompassing linguistic variation, historical documentation, cultural expressions, and literary analysis. By analyzing technical frameworks, current methodologies, and ethical considerations, this paper identifies key challenges such as data accessibility, model adaptability, and cultural sensitivity. Given the cultural, historical, and linguistic richness inherent in low-resource languages, this work emphasizes interdisciplinary collaboration and the development of customized models as promising avenues for advancing research in this domain. By underscoring the potential of integrating artificial intelligence with the humanities to preserve and study humanity’s linguistic and cultural heritage, this study fosters global efforts towards safeguarding intellectual diversity.

[83] Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models

Ido Cohen, Daniela Gottesman, Mor Geva, Raja Giryes

Main category: cs.CL

TL;DR: VLMs show significant accuracy drop (up to 18%) when answering factual questions about entities presented visually vs. textually, due to inefficient information flow from image tokens to query tokens in deeper layers.

Details

Motivation: To investigate why vision-language models perform worse on factual questions when entities are presented visually compared to textually, and to understand the internal mechanics causing this performance gap.

Method: Created PopVQA dataset to separate entity recognition from question answering, benchmarked several models, and used mechanistic interpretability tools to analyze information flow between image and query tokens across model layers.

Result: Found significant accuracy drop (up to 18%) for visual vs. textual entity presentation. Mechanistic analysis revealed that meaningful information flow from image tokens occurs only in deep layers, with critical image processing happening in middle layers, leaving few layers for reasoning.

Conclusion: VLMs have inefficient layer utilization for reasoning, with delayed information flow from visual inputs. These insights reveal internal mechanics and offer pathways to enhance VLM reasoning capabilities by optimizing information flow and layer allocation.

Abstract: Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model performance when answering factual questions about an entity described in text versus depicted in an image. Our results reveal a significant accuracy drop - reaching 18% for some models - when the entity is presented visually instead of textually. To study this gap we present PopVQA, a dataset which allows separating entity recognition and question answering, and use it to benchmark several models. We hypothesize that this decline arises from limitations in how information flows from image tokens to query tokens. Thus, we use mechanistic interpretability tools to reveal that, although image tokens are preprocessed by the vision encoder, meaningful information flow from these tokens occurs only in the much deeper layers. Furthermore, critical image processing happens in the language model’s middle layers, allowing few layers for consecutive reasoning, highlighting a potential inefficiency in how the model utilizes its layers for reasoning. These insights shed light on the internal mechanics of VLMs and offer pathways for enhancing their reasoning capabilities. PopVQA can be found at https://huggingface.co/datasets/idoco/PopVQA.

[84] GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

Bo Lv, Chen Tang, Zifan Zheng, Bohao Yang, Kun Zhao, Ning Liao, Xiaoxing Wang, Feiyu Xiong, Zhiyu Li, Nayu Liu, Jingchi Jiang

Main category: cs.CL

TL;DR: GRAPHMOE enhances Mixture-of-Experts networks by adding inter-expert connections and recurrent routing to simulate iterative thinking, achieving SOTA performance with LoRA implementation.

Details

Motivation: Traditional MoE networks have independent experts that don't communicate, limiting their potential. The paper explores whether interconnecting experts could enhance MoE network performance and cognitive depth.

Method: GRAPHMOE introduces a self-rethinking mechanism on Pseudo GraphMoE networks with recurrent routing strategy to simulate iterative thinking steps, facilitating information flow among expert nodes. Implemented using Low-Rank Adaptation (LoRA) techniques.

Result: GRAPHMOE outperforms other LoRA-based models and achieves state-of-the-art performance on various benchmark datasets. The recurrent routing strategy shows promise for enhancing reasoning capabilities.

Conclusion: Interconnecting experts in MoE networks through GRAPHMOE’s recurrent routing enhances performance and cognitive depth, offering a novel approach to improve language model reasoning capabilities.

Abstract: Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert models as opposed to a single large network. However, these experts typically operate independently, leaving a question open about whether interconnecting these models could enhance the performance of MoE networks. In response, we introduce GRAPHMOE, a novel method aimed at augmenting the cognitive depth of language models via a self-rethinking mechanism constructed on Pseudo GraphMoE networks. GRAPHMOE employs a recurrent routing strategy to simulate iterative thinking steps, thereby facilitating the flow of information among expert nodes. We implement the GRAPHMOE architecture using Low-Rank Adaptation techniques (LoRA) and conduct extensive experiments on various benchmark datasets. The experimental results reveal that GRAPHMOE outperforms other LoRA based models, achieving state-of-the-art (SOTA) performance. Additionally, this study explores a novel recurrent routing strategy that may inspire further advancements in enhancing the reasoning capabilities of language models.

[85] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang

Main category: cs.CL

TL;DR: LLMs can develop advanced reasoning abilities through pure reinforcement learning without human-labeled demonstrations, outperforming supervised learning approaches on math, coding, and STEM tasks.

Details

Motivation: Current LLM reasoning success depends heavily on human-annotated demonstrations and struggles with complex problems. There's a need for methods that can develop reasoning capabilities without extensive human labeling.

Method: Uses pure reinforcement learning framework to incentivize reasoning abilities in LLMs, facilitating emergent development of advanced reasoning patterns like self-reflection, verification, and dynamic strategy adaptation.

Result: The RL-trained model achieves superior performance on verifiable tasks (mathematics, coding competitions, STEM fields) compared to models trained via conventional supervised learning on human demonstrations. The emergent reasoning patterns can also be used to enhance smaller models.

Conclusion: Reinforcement learning can effectively develop advanced reasoning capabilities in LLMs without human-labeled trajectories, creating models that outperform supervised approaches and whose reasoning patterns can transfer to smaller models.

Abstract: General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models’ capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.

[86] Sorting the Babble in Babel: Assessing the Performance of Language Identification Algorithms on the OpenAlex Database

Maxime Holmberg Sainte-Marie, Diego Kozlowski, Lucía Céspedes, Vincent Larivière

Main category: cs.CL

TL;DR: Comparison of Python language identification algorithms for OpenAlex database optimization shows FastText on Titles corpus performs best when recall or processing speed matters, while LangID on greedy corpus excels when precision is prioritized.

Details

Motivation: To optimize linguistic indexing of OpenAlex database by evaluating different language identification procedures, addressing the lack of truly multilingual large-scale bibliographic databases and supporting cross-linguistic research.

Method: Compared performance of various Python-based language identification algorithms (including LangID and FastText) on different metadata corpora extracted from manually-annotated article samples. Analyzed precision, recall, processing speeds, and simulated database-level performance using probabilistic confusion matrices and language frequency modeling.

Result: Procedure performance depends on measurement priorities: LangID on greedy corpus performs best when precision is preferred, while FastText on Titles corpus outperforms all alternatives when recall is considered important or processing times matter.

Conclusion: Results confirm OpenAlex’s potential for cross-linguistic research and provide practical guidance for language identification optimization based on specific performance priorities (precision vs recall vs speed).

Abstract: This project aims to optimize the linguistic indexing of the OpenAlex database by comparing the performance of various Python-based language identification procedures on different metadata corpora extracted from a manually-annotated article sample \footnote{OpenAlex used the results presented in this article to inform the language metadata overhaul carried out as part of its recent Walden system launch. The precision and recall performance of each algorithm, corpus, and language is first analyzed, followed by an assessment of processing speeds recorded for each algorithm and corpus type. These different performance measures are then simulated at the database level using probabilistic confusion matrices for each algorithm, corpus, and language, as well as a probabilistic modeling of relative article language frequencies for the whole OpenAlex database. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on the greedy corpus gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, the procedure that consists in the application of the FastText algorithm on the Titles corpus outperforms all other alternatives. Given the lack of truly multilingual large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic and comprehensive measurement and evaluation.

[87] CoSER: A Comprehensive Literary Dataset and Framework for Training and Evaluating LLM Role-Playing and Persona Simulation

Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao

Main category: cs.CL

TL;DR: CoSER introduces a comprehensive framework for role-playing language agents with a high-quality dataset of 17,966 characters from 771 books, open models (CoSER 8B/70B), and an evaluation protocol based on acting methodology.

Details

Motivation: Existing role-playing language agents struggle with simulating established characters due to lack of authentic character datasets and nuanced evaluation methods using such data.

Method: Created CoSER dataset with authentic dialogues and diverse data types; introduced “given-circumstance acting” methodology for training/evaluation; developed CoSER 8B and 70B models based on LLaMA-3.1.

Result: CoSER 70B achieves state-of-the-art performance, surpassing or matching GPT-4o on evaluation benchmarks (75.80% on InCharacter, 93.47% on LifeChoice). Dataset proves valuable for training, evaluation, and retrieval.

Conclusion: CoSER provides a comprehensive solution for effective role-playing language agents through high-quality data, open models, and evaluation protocols, advancing the field of character simulation.

Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.

[88] A Survey of Text Classification Under Class Distribution Shift

Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: Survey paper on open-set text classification methods for handling distribution shifts in text data over time, categorizing approaches by problem constraints and discussing mitigation strategies.

Details

Motivation: Machine learning models assume training and test data come from the same distribution, but in practice this assumption is often violated as test data distributions change over time. Text classification is particularly affected since people constantly discuss new topics, requiring methods to handle shifting class distributions.

Method: The paper surveys research on open-set text classification by categorizing methods based on distribution shift constraints and problem formulations: learning with the Universum, zero-shot learning, and open-set learning. It discusses predominant mitigation approaches for each problem setup.

Result: The survey organizes existing literature on handling distribution shifts in text classification, identifies that continual learning can solve many issues caused by shifting class distributions, and provides a maintained repository of relevant papers.

Conclusion: The paper provides a comprehensive survey of open-set text classification methods, identifies future research directions to advance beyond current state-of-the-art, and highlights continual learning as a promising approach for handling distribution shifts in text data.

Abstract: The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.

[89] KVCrush: Key value cache size-reduction using similarity in head-behaviour

Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Nilesh Jain

Main category: cs.CL

TL;DR: KVCrush is a KV cache compression technique that reduces memory footprint by 4x with <1% accuracy drop and minimal latency overhead, compatible with existing deployment schemes.

Details

Motivation: KV caching is essential for LLM inference efficiency but creates huge memory bottlenecks with large context lengths, limiting batch sizes and throughput. Existing compression techniques often degrade model accuracy.

Method: Proposes KVCrush with two components: 1) an alternate representation scheme for key-value states, and 2) a low-overhead token pruning algorithm that considers token distribution in the KV cache.

Result: Reduces LongBench KV Cache size by 4x with less than 1% accuracy drop, achieves state-of-the-art average accuracy with minimal overhead (<0.5% total inference latency), and outperforms importance-based token retention schemes.

Conclusion: KVCrush effectively compresses KV cache memory while maintaining accuracy, is compatible with practical LLM deployments (vLLM, mixed precision quantization), and addresses the memory bottleneck without significant performance degradation.

Abstract: Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model’s batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.

[90] nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning

Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, Yuyu Luo

Main category: cs.CL

TL;DR: nBench 2.0 is a new benchmark for evaluating Text-to-Visualization systems on ambiguous queries, featuring 7,878 queries and 24,076 visualizations across 153 domains, with Step-Text2Vis model achieving SOTA performance.

Details

Motivation: Text2VIS systems struggle with ambiguous natural language queries where users express visualization needs imprecisely, requiring better evaluation benchmarks and models to handle ambiguity.

Method: Created nBench 2.0 benchmark using controlled ambiguity-injection pipeline that starts with unambiguous seed visualizations and injects ambiguities through reverse-generation workflow. Also proposed Step-Text2Vis, an LLM-based model trained on nBench 2.0 with step-wise preference optimization.

Result: Step-Text2Vis outperforms all baseline LLMs on ambiguous Text2VIS tasks, setting new state-of-the-art performance on the nBench 2.0 benchmark.

Conclusion: nBench 2.0 provides comprehensive evaluation for ambiguous Text2VIS tasks, and Step-Text2Vis demonstrates effective handling of ambiguous queries through step-wise reasoning and preference optimization.

Abstract: Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nBench 2.0, a new benchmark designed to evaluate Text2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous Text2VIS tasks using nBench 2.0. We also propose Step-Text2Vis, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-Text2Vis outperforms all baselines, setting a new state-of-the-art for ambiguous Text2VIS tasks. Our source code and data are available at https://nvbench2.github.io/

[91] Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian-guang Lou, Haoyi Xiong

Main category: cs.CL

TL;DR: This survey paper systematically reviews evaluation methods for LLM-based conversational agents, developing two taxonomies: one for what to evaluate (agent components and dimensions) and another for how to evaluate (methodologies and metrics).

Details

Motivation: There is a need for comprehensive evaluation frameworks for LLM-based agents in multi-turn conversations, as existing methods may not adequately capture the dynamic, interactive nature of such dialogues or provide holistic assessment approaches.

Method: Used a PRISMA-inspired systematic review framework to analyze nearly 250 scholarly sources, then developed two interrelated taxonomy systems: one for evaluation components/dimensions and another for evaluation methodologies.

Result: Created structured taxonomies covering: (1) what to evaluate - task completion, response quality, user experience, memory/context retention, planning/tool integration; (2) how to evaluate - annotation-based, automated metrics, hybrid strategies, and self-judging LLM methods.

Conclusion: The paper establishes a comprehensive evaluation framework for LLM-based conversational agents that captures both traditional language metrics and advanced techniques suitable for dynamic multi-turn dialogues, providing a solid foundation for future research and assessment.

Abstract: This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.

[92] Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Weihang You, Hanqi Jiang, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, Ping Ma

Main category: cs.CL

TL;DR: Survey paper analyzing Knowledge Distillation (KD) and Dataset Distillation (DD) techniques for compressing Large Language Models while preserving reasoning capabilities and linguistic diversity.

Details

Motivation: Address the computational and data demands of exponentially growing LLMs by developing efficient compression strategies that maintain model performance and capabilities.

Method: Comprehensive analysis of KD methods (task-specific alignment, rationale-based training, multi-teacher frameworks) and DD techniques (gradient matching, latent space regularization, generative synthesis), plus exploration of their integration.

Result: Identifies how KD and DD integration can create more effective and scalable compression strategies, with applications in healthcare and education enabling efficient deployment without performance loss.

Conclusion: While substantial progress has been made, challenges remain in preserving emergent reasoning, adapting to evolving models/datasets, and establishing evaluation protocols; tighter integration of KD and DD principles offers path to sustainable, resource-efficient LLMs.

Abstract: The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.

[93] Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu, Shangzhe Li, Xinhua Zhang

Main category: cs.CL

TL;DR: A framework for temporal difference-based distillation of large language models that exploits distributional sparsity by operating on reduced action spaces.

Details

Motivation: Large language models have high computational costs due to their massive sizes, and distillation is needed to create smaller, more efficient models. Existing distillation methods can be viewed through imitation learning or inverse reinforcement learning lenses.

Method: Introduces a general temporal difference-based distillation framework that exploits the distributional sparsity of teacher models (where most probability mass is concentrated on a small subset of tokens). The framework operates on a reduced action space (subset of vocabulary) and shows how practical algorithms can be derived from this approach.

Result: The paper demonstrates how practical algorithms can be derived from the framework and shows resulting performance improvements, though specific results are not detailed in the abstract.

Conclusion: Rather than proposing another specific temporal difference method, the paper provides a general framework for temporal difference-based distillation that leverages distributional sparsity to improve efficiency and performance.

Abstract: Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

[94] Explainability-Based Token Replacement on LLM-Generated Text

Hadi Mohammadi, Anastasia Giachanou, Daniel L. Oberski, Ayoub Bagheri

Main category: cs.CL

TL;DR: XAI methods can reduce AI-generated text detectability by modifying influential tokens, but ensemble classifiers remain robust against such manipulations.

Details

Motivation: AI-generated text is becoming more human-like but still exhibits detectable patterns; researchers want to explore how explainable AI methods can both reduce detectability and create robust detection approaches.

Method: Train ensemble classifier to detect AI-generated text, use SHAP and LIME to identify influential tokens, propose four explainability-based token replacement strategies to modify these tokens, and evaluate on multiple languages/domains.

Result: Token replacement strategies significantly reduce single classifier’s detection ability, but ensemble classifier maintains strong performance across languages and domains, showing robustness against token-level manipulations.

Conclusion: XAI methods can make AI-generated text harder to detect by targeting influential tokens, but ensemble-based detection strategies provide robustness against such evasion techniques, highlighting the need for adaptive multi-model approaches.

Abstract: Generative models, especially large language models (LLMs), have shown remarkable progress in producing text that appears human-like. However, they often exhibit patterns that make their output easier to detect than text written by humans. In this paper, we investigate how explainable AI (XAI) methods can be used to reduce the detectability of AI-generated text (AIGT) while also introducing a robust ensemble-based detection approach. We begin by training an ensemble classifier to distinguish AIGT from human-written text, then apply SHAP and LIME to identify tokens that most strongly influence its predictions. We propose four explainability-based token replacement strategies to modify these influential tokens. Our findings show that these token replacement approaches can significantly diminish a single classifier’s ability to detect AIGT. However, our ensemble classifier maintains strong performance across multiple languages and domains, showing that a multi-model approach can mitigate the impact of token-level manipulations. These results show that XAI methods can make AIGT harder to detect by focusing on the most influential tokens. At the same time, they highlight the need for robust, ensemble-based detection strategies that can adapt to evolving approaches for hiding AIGT.

[95] EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer

Main category: cs.CL

TL;DR: EMONET-VOICE introduces a comprehensive speech emotion recognition resource with a 5,000-hour multilingual pre-training dataset covering 40 fine-grained emotions and a rigorously validated benchmark, using synthetic voice generation for ethical inclusion of sensitive emotions.

Details

Motivation: Existing SER datasets are limited to 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states like pain and shame.

Method: Two-component approach: (1) EmoNet-Voice Big - 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, (2) EmoNet-Voice Bench - 4.7k benchmark samples with unanimous expert consensus. Uses state-of-the-art synthetic voice generation for privacy-preserving, ethical data collection, with each sample validated by three psychology experts.

Result: Empathic Insight models trained on synthetic data achieve strong real-world generalization (tested on EmoDB and RAVDESS). High-arousal emotions like anger achieve 95% accuracy, while perceptually similar emotions like sadness vs. distress show only 63% discrimination, providing quantifiable metrics for advancing nuanced emotion AI.

Conclusion: EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research, addressing limitations of existing datasets while enabling ethical inclusion of sensitive emotional states through synthetic voice generation.

Abstract: Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EMONET-VOICE, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach enables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.

[96] Exploring Cultural Variations in Moral Judgments with Large Language Models

Hadi Mohammadi, Ayoub Bagheri

Main category: cs.CL

TL;DR: LLMs show varying ability to capture cross-cultural moral values, with advanced instruction-tuned models aligning better with human survey data than earlier/smaller models, though alignment remains stronger with WEIRD nations.

Details

Motivation: To examine whether LLMs can capture culturally diverse moral values and mirror variations in moral attitudes reported by global surveys (WVS and PEW), addressing concerns about cultural bias in AI systems.

Method: Compared smaller monolingual/multilingual models (GPT-2, OPT, BLOOMZ, Qwen) with instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, Llama-3.3-70B-Instruct) using log-probability-based moral justifiability scores correlated with survey data across ethical topics.

Result: Earlier/smaller models show near-zero or negative correlations with human judgments, while advanced instruction-tuned models achieve substantially higher positive correlations. Models align better with WEIRD nations than other regions, though scaling and instruction tuning improve cross-cultural alignment.

Conclusion: While scaling and instruction tuning improve LLMs’ alignment with cross-cultural moral norms, challenges persist for certain topics and regions, highlighting needs for bias analysis, training data diversity, and strategies to enhance cultural sensitivity.

Abstract: Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center’s Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emph{moral justifiability} scores, we correlate each model’s outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. We provide a detailed regional analysis revealing that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions. While scaling model size and using instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, information retrieval implications, and strategies for improving the cultural sensitivity of LLMs.

[97] Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding

Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee

Main category: cs.CL

TL;DR: Introduces Thunder-NUBench, a specialized benchmark for evaluating LLMs’ sentence-level understanding of negation through diverse contrastive structures.

Details

Motivation: Current benchmarks treat negation as a minor detail within broader tasks, lacking specialized evaluation tools for this fundamental linguistic phenomenon that challenges LLMs' semantic understanding.

Method: Creates Thunder-NUBench with manually curated sentence-negation pairs and multiple-choice datasets, contrasting standard negation with structurally diverse alternatives like local negation, contradiction, and paraphrase.

Result: Introduces a novel benchmark specifically designed to assess LLMs’ comprehension of negation at the sentence level, going beyond surface-level cue identification.

Conclusion: Thunder-NUBench provides a comprehensive evaluation framework for LLMs’ negation understanding, addressing a gap in current benchmarking approaches.

Abstract: Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond merely identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually curated sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models’ understanding of negation.

[98] MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection

Hexiang Gu, Qifan Yu, Yuan Liu, Zikang Li, Saihui Hou, Jian Zhao, Zhaofeng He

Main category: cs.CL

TL;DR: Researchers created MemeMind, a large-scale harmful meme dataset with detailed reasoning annotations, and proposed MemeGuard, a reasoning-oriented multimodal detection model that improves both accuracy and interpretability of harmful meme detection.

Details

Motivation: Harmful memes are challenging to detect due to their implicit content conveyed through metaphors and humor. Current methods struggle with implicit risks and nuanced semantics, and there's a scarcity of large-scale, high-quality datasets for harmful memes.

Method: 1) Constructed MemeMind dataset with detailed Chain-of-Thought reasoning annotations aligned with international standards and internet context. 2) Proposed MemeGuard, a reasoning-oriented multimodal detection model designed to better capture implicit intentions in memes.

Result: MemeGuard outperforms existing state-of-the-art methods on the MemeMind dataset, demonstrating significant improvements in both detection accuracy and model interpretability.

Conclusion: The MemeMind dataset and MemeGuard model establish a solid foundation for future research in harmful meme detection, addressing the challenges of implicit content and providing better interpretability for model decisions.

Abstract: As a multimodal medium combining images and text, memes frequently convey implicit harmful content through metaphors and humor, rendering the detection of harmful memes a complex and challenging task. Although recent studies have made progress in detection accuracy and interpretability, large-scale, high-quality datasets for harmful memes remain scarce, and current methods still struggle to capture implicit risks and nuanced semantics. Thus, we construct MemeMind, a large-scale harmful meme dataset. Aligned with the international standards and the context of internet, MemeMind provides detailed Chain-of-Thought (CoT) reasoning annotations to support fine-grained analysis of implicit intentions in memes. Based on this dataset, we further propose MemeGuard, a reasoning-oriented multimodal detection model that significantly improves both the accuracy of harmful meme detection and the interpretability of model decisions. Extensive experimental results demonstrate that MemeGuard outperforms existing state-of-the-art methods on the MemeMind dataset, establishing a solid foundation for future research in harmful meme detection.

[99] Cosmos: Compressed and Smooth Latent Space for Text Diffusion Modeling

Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov

Main category: cs.CL

TL;DR: Cosmos introduces a compressed latent space for text diffusion models, achieving 8× compression while matching token-level diffusion quality and offering 2× faster inference than autoregressive models.

Details

Motivation: Autoregressive models are slow and struggle with global coherence; token-level diffusion models face dimensionality challenges. Need a faster, more coherent text generation approach.

Method: Learns compressed latent space via autoencoder trained for token reconstruction and alignment with frozen language encoder. Uses diffusion in this space with perturbation-based augmentations.

Result: 8× compression maintains quality comparable to token-level diffusion. Longer latent sequences surpass both diffusion and autoregressive baselines. 2× faster inference with comparable/superior quality on 4 tasks.

Conclusion: Cosmos demonstrates effective latent space compression for text diffusion, enabling faster inference while maintaining or improving generation quality across diverse tasks.

Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference. Code is released at \href{https://github.com/MeshchaninovViacheslav/cosmos}{GitHub}

[100] La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu

Main category: cs.CL

TL;DR: LaRoSA introduces layerwise orthogonal rotations to transform LLM activations for better sparsification, achieving consistent sparsity and reliable speed-up without retraining or magnitude-based pruning.

Details

Motivation: Existing activation sparsity methods either require time-consuming recovery training (hindering adoption) or rely on empirical magnitude-based pruning (causing fluctuating sparsity and unstable inference speed-up).

Method: LaRoSA uses layerwise orthogonal rotations to transform input activations into rotated forms more suitable for sparsification, then applies Top-K selection within rotated activations to achieve consistent model-level sparsity.

Result: For LLaMA2-7B at 40% sparsity: 0.17 perplexity gap, 1.30x wall-clock time speed-up, reduces zero-shot accuracy gap to 0.54% vs dense model, surpasses TEAL by 1.77% and CATS by 17.14%.

Conclusion: LaRoSA effectively improves LLM efficiency without additional training or magnitude-based pruning, demonstrating minimal performance degradation and robust inference acceleration across various LLM sizes and types.

Abstract: Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

[101] Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

Main category: cs.CL

TL;DR: Text2VLM is a multi-stage pipeline that converts text-only datasets into multimodal formats to evaluate Visual Language Models’ vulnerability to typographic prompt injection attacks, revealing significant safety weaknesses in current VLMs.

Details

Motivation: Existing evaluation datasets for VLMs are heavily text-focused, leaving visual vulnerabilities under-evaluated. There's a need for robust model alignment when handling multimodal content combining text and images, especially against typographic prompt injection attacks.

Method: Text2VLM uses a multi-stage pipeline that identifies harmful content in original text-only datasets and converts it into typographic images, creating multimodal prompts specifically designed to test VLMs’ resilience against prompt injection attacks.

Result: Evaluation shows open-source VLMs are significantly more susceptible to prompt injection when visual inputs are introduced, revealing critical alignment weaknesses. There’s also a substantial performance gap compared to closed-source frontier models. Human validation confirms Text2VLM’s effectiveness in aligning with human expectations.

Conclusion: Text2VLM provides a scalable tool for comprehensive safety assessment of VLMs, contributing to more robust safety mechanisms and advancing the safe deployment of VLMs in real-world applications by better evaluating multimodal vulnerabilities.

Abstract: The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

[102] Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers

Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney

Main category: cs.CL

TL;DR: Transformer models achieve higher stem accuracy than humans on Spanish irregular morphomic patterns, but show different response preferences and limited phonological generalization similar to humans only under specific training frequency conditions.

Details

Motivation: To investigate whether transformer models generalize morphological patterns like humans do, specifically examining their ability to replicate human-like sensitivity to the morphome - a complex linguistic phenomenon in Spanish irregular verb patterns.

Method: Direct comparison of transformers to human behavioral data using the same analytical framework as the original human study. Controlled input conditions with three frequency distributions: natural, low-frequency, and high-frequency distributions of verbs exhibiting irregular morphomic patterns.

Result: Transformer models achieved higher stem-accuracy than human participants. However, response preferences diverged: humans consistently favored “natural” inflection across all items, while models preferred irregular forms, with choices modulated by training data proportions. Models trained on natural and low-frequency distributions (but not high-frequency) showed sensitivity to phonological similarity between test items and Spanish L-shaped verbs, mirroring limited human phonological generalization.

Conclusion: Transformer models demonstrate different generalization patterns than humans on morphological tasks - while achieving higher accuracy, they show divergent response preferences and only limited, condition-dependent phonological generalization similar to human behavior.

Abstract: Do transformer models generalize morphological patterns like humans do? We investigate this by directly comparing transformers to human behavioral data on Spanish irregular morphomic patterns from \citet{Nevins2015TheRA}. We adopt the same analytical framework as the original human study. Under controlled input conditions, we evaluate whether transformer models can replicate human-like sensitivity to the morphome, a complex linguistic phenomenon. Our experiments focus on three frequency conditions: natural, low-frequency, and high-frequency distributions of verbs exhibiting irregular morphomic patterns. Transformer models achieve higher stem-accuracy than human participants. However, response preferences diverge: humans consistently favor the “natural” inflection across all items, whereas models preferred the irregular forms, and their choices are modulated by the proportion of irregular verbs present during training. Moreover, models trained on the natural and low-frequency distributions, but not the high-frequency distribution, exhibit sensitivity to phonological similarity between test items and Spanish L-shaped verbs, mirroring a limited aspect of human phonological generalization.

[103] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges

Yuqi Tang, Kehua Feng, Yunfeng Wang, Zhiwen Chen, Chengfei Lv, Gang Yu, Keyan Ding, Huajun Chen

Main category: cs.CL

TL;DR: Efficient dialogue evaluator that aggregates multiple LLM judges’ knowledge into a single model, reducing computational cost while maintaining evaluation quality.

Details

Motivation: Current LLM-as-a-judge methods suffer from biases, and while multi-judge approaches improve reliability, they incur significant computational overhead during inference.

Method: Propose an efficient dialogue evaluator that captures collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model.

Result: Outperforms existing baselines across seven single rating and pairwise comparison dialogue evaluation benchmarks, demonstrating efficiency and robustness.

Conclusion: The method preserves advantages of diverse multi-judge feedback while drastically reducing evaluation cost, enabling fast, flexible, and fine-grained dialogue quality assessment.

Abstract: Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the “LLM-as-a-judge” paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast, flexible, and fine-grained dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

[104] User-Assistant Bias in LLMs

Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie

Main category: cs.CL

TL;DR: LLMs trained with role tags develop user-assistant bias - preferring user vs assistant information in conflicts. Most instruction-tuned models show strong user bias, while base/reasoning models are neutral. Human-preference alignment increases user bias, reasoning fine-tuning reduces it.

Details

Motivation: Role tags (system, user, assistant, tool) in LLMs are essential for instruction following but create asymmetries in training data that introduce inductive biases. The paper aims to study user-assistant bias - LLMs' tendency to preferentially rely on information from user vs assistant roles during conflicts.

Method: 1) Formalize user-assistant bias concept; 2) Create task-agnostic benchmark UserAssist to evaluate bias; 3) Evaluate 52 frontier models; 4) Conduct controlled fine-tuning experiments to isolate which post-training recipes drive bias; 5) Use direct preference optimization (DPO) on UserAssist-train to bidirectionally control bias; 6) Test generalization to realistic multi-turn conversations.

Result: Most instruction-tuned models exhibit strong user bias, while base and reasoning models are close to neutral. Human-preference alignment amplifies user bias, reasoning fine-tuning reduces it. User-assistant bias can be bidirectionally controlled via DPO, and the resulting bias reliably generalizes to realistic multi-turn conversations.

Conclusion: Role-tagged training creates underexplored biases in LLMs. The paper provides a principled framework to diagnose and control tag-induced biases, revealing that different training recipes (alignment vs reasoning) have opposite effects on user-assistant bias.

Abstract: Modern large language models (LLMs) are typically trained and deployed using structured role tags (e.g. system, user, assistant, tool) that explicitly mark the source of each piece of context. While these tags are essential for instruction following and controllability, asymmetries in the training data associated with different role tags can introduce inductive biases. In this paper, we study this phenomenon by formalizing user-assistant bias, defined as the tendency of an LLM to preferentially rely on information from either the user or assistant role when there is a conflict. We introduce a task-agnostic benchmark UserAssist and evaluate such bias in 52 frontier models. We observe that most of the instruction-tuned models exhibit strong user bias, whereas base and reasoning models are close to neutral. Using controlled fine-tuning experiments, we isolate which post-training recipes drive the observed user-assistant bias. We find that human-preference alignment amplifies user bias, while reasoning fine-tuning reduces it. Finally, we show that user-assistant bias can be bidirectionally controlled via direct preference optimization (DPO) on UserAssist-train, and that the resulting bias reliably generalizes to a more realistic multi-turn conversation dataset. These results reveal an underexplored consequence of role-tagged training and provide a principled framework to diagnose and control tag-induced biases in modern LLMs.

[105] Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation

Hongyu Cao, Yuxuan Wu, Yucheng Cai, Xianyu Zhao, Zhijian Ou

Main category: cs.CL

TL;DR: JSA-RAG: A new end-to-end training method for retrieval-augmented generation using joint stochastic approximation to address biased/high-variance gradient issues in marginalizing over discrete latent passages.

Details

Motivation: Traditional RAG training methods (top-K marginalization and VRAG) suffer from biased or high-variance gradient estimates when marginalizing over discrete latent passages, making end-to-end optimization challenging.

Method: Proposes JSA-RAG using joint stochastic approximation algorithm, a stochastic extension of EM algorithm, specifically designed for estimating discrete latent variable models in RAG frameworks.

Result: Extensive experiments on 5 datasets for 2 tasks (open-domain QA, knowledge-grounded dialogs) show JSA-RAG significantly outperforms both vanilla RAG and VRAG.

Conclusion: JSA-RAG provides effective end-to-end training for RAG models with low-variance gradient estimates, improving both generation and retrieval performance compared to existing methods.

Abstract: Retrieval-augmented generation (RAG) has become a widely recognized paradigm to combine parametric memory with non-parametric memories. An RAG model consists of two serial connecting components (retriever and generator). A major challenge in end-to-end optimization of the RAG model is that marginalization over relevant passages (modeled as discrete latent variables) from a knowledge base is required. Traditional top-K marginalization and variational RAG (VRAG) suffer from biased or high-variance gradient estimates. In this paper, we propose and develop joint stochastic approximation (JSA) based end-to-end training of RAG, which is referred to as JSA-RAG. The JSA algorithm is a stochastic extension of the EM (expectation-maximization) algorithm and is particularly powerful in estimating discrete latent variable models. Extensive experiments are conducted on five datasets for two tasks (open-domain question answering, knowledge-grounded dialogs) and show that JSA-RAG significantly outperforms both vanilla RAG and VRAG. Further analysis shows the efficacy of JSA-RAG from the perspectives of generation, retrieval, and low-variance gradient estimate.

[106] Polarity Detection of Sustainable Development Goals in News Text

Andrea Cadeddu, Alessandro Chessa, Vincenzo De Leo, Gianni Fenu, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino, Luca Secchi

Main category: cs.CL

TL;DR: Proposes SDG polarity detection task to assess whether text indicates positive/negative impact on Sustainable Development Goals, introduces SDG-POD benchmark dataset, and evaluates LLMs showing task remains challenging but fine-tuned models can achieve good performance.

Details

Motivation: While NLP/LLMs can classify text relevance to SDGs, determining directionality (positive/negative impact) is equally important for sustainability monitoring but remains an underexplored challenge.

Method: Proposes novel SDG polarity detection task, creates SDG-POD benchmark dataset with original and synthetic data, evaluates six state-of-the-art LLMs in zero-shot and fine-tuned configurations, and tests data augmentation techniques.

Result: Task remains challenging for current LLMs, but fine-tuned models (especially QWQ-32B) achieve good performance on specific SDGs (9, 12, 15). Synthetic data augmentation improves model performance, demonstrating effectiveness in resource-constrained domains.

Conclusion: Advances sustainability monitoring methodology, provides insights for developing efficient polarity detection systems, and shows data enrichment techniques can address domain challenges despite task difficulty for current LLMs.

Abstract: The United Nations’ Sustainable Development Goals (SDGs) provide a globally recognised framework for addressing critical societal, environmental, and economic challenges. Recent developments in natural language processing (NLP) and large language models (LLMs) have facilitated the automatic classification of textual data according to their relevance to specific SDGs. Nevertheless, in many applications, it is equally important to determine the directionality of this relevance; that is, to assess whether the described impact is positive, neutral, or negative. To tackle this challenge, we propose the novel task of SDG polarity detection, which assesses whether a text segment indicates progress toward a specific SDG or conveys an intention to achieve such progress. To support research in this area, we introduce SDG-POD, a benchmark dataset designed specifically for this task, combining original and synthetically generated data. We perform a comprehensive evaluation using six state-of-the-art large LLMs, considering both zero-shot and fine-tuned configurations. Our results suggest that the task remains challenging for the current generation of LLMs. Nevertheless, some fine-tuned models, particularly QWQ-32B, achieve good performance, especially on specific Sustainable Development Goals such as SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land). Furthermore, we demonstrate that augmenting the fine-tuning dataset with synthetically generated examples yields improved model performance on this task. This result highlights the effectiveness of data enrichment techniques in addressing the challenges of this resource-constrained domain. This work advances the methodological toolkit for sustainability monitoring and provides actionable insights into the development of efficient, high-performing polarity detection systems.

[107] On the Robustness of Answer Formats in Medical Reasoning Models

Pittawat Taveekitworachai, Natpatchara Pongjirapat, Krittaphas Chaisutyakorn, Piyalitt Ittichaiwong, Tossaporn Saengja, Kunat Pipatanakul

Main category: cs.CL

TL;DR: Medical reasoning models show format-dependent performance variation (35-100% robustness), with supervised fine-tuning being more stable than reinforcement learning across different answer formats.

Details

Motivation: While medical reasoning models achieve high accuracy on benchmarks, practical deployment requires robustness to varying output constraints - the same medical question should yield correct answers regardless of requested format.

Method: Proposed answer-format robustness metric and evaluated 15 models across three formats (multiple-choice, open-ended QA, ranked lists). Conducted controlled fine-tuning experiments on shared backbone with matched training data to isolate effects of fine-tuning paradigms.

Result: Substantial variation in format robustness (35-100%) across models. Supervised fine-tuning yields more stable behavior across formats, while reinforcement fine-tuning exhibits higher cross-format brittleness dependent on reward design.

Conclusion: Answer-format robustness in medical reasoning models is trainable but brittle, requiring careful evaluation for practical medical use, with fine-tuning paradigm significantly impacting stability.

Abstract: Medical reasoning models (MRMs) achieve superior performance on medical benchmarks compared to medical LLMs; however, high accuracy alone is insufficient for practical deployment. One of such requirements for real-world application is robustness to varying output constraints. Specifically, posing the same medical question while requesting different answer formats should not affect the underlying correctness of the response. We investigate this phenomenon in this paper, focusing on MRMs. To quantify this behavior, we propose the metric answer-format robustness: the ability to reliably generate correct outputs across varying specified formats. We examine three representative formats: multiple-choice, open-ended question-answering, and ranked lists. Across 15 proprietary and open-weight models, we observe substantial variation in format robustness (35-100%). Furthermore, we conduct controlled fine-tuning experiments on a shared backbone with matched training data to isolate the effects of the fine-tuning paradigm. We find that supervised fine-tuning yields more stable behavior across formats, whereas reinforcement fine-tuning often exhibits higher cross-format brittleness, with the degree of instability strongly dependent on reward design. Overall, answer-format robustness in MRMs is trainable yet brittle and requires careful evaluation for practical medical use.

[108] Self-Speculative Biased Decoding for Faster Re-Translation

Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang

Main category: cs.CL

TL;DR: SSBD accelerates LLM-based simultaneous translation by reusing previous outputs as speculative drafts, reducing redundant computation in re-translation approaches.

Details

Motivation: LLMs achieve strong translation quality but have high inference cost/latency for simultaneous translation. Re-translation is practical but suffers from redundant computation.

Method: Self-Speculative Biased Decoding (SSBD) reuses previous output as speculative draft, verifies with lightweight bias in single forward pass, resumes decoding from first divergence. Includes display-only masking to hide unstable suffixes.

Result: SSBD achieves substantial speedup over standard re-translation while maintaining comparable translation quality, without architectural changes, auxiliary models, or fine-tuning.

Conclusion: SSBD provides efficient inference acceleration for LLM-based simultaneous translation by exploiting temporal coherence, offering practical solution without requiring model modifications.

Abstract: Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly regenerating the target output as the source input grows, but it suffers from substantial redundant computation. We propose Self-Speculative Biased Decoding (SSBD), a simple and tuning-free inference method that accelerates re-translation by exploiting temporal coherence in streaming translation. SSBD reuses the model’s previous output as a speculative draft for the updated input, verifies the draft efficiently in a single forward pass with a lightweight bias, and resumes autoregressive decoding only from the first divergence. We further introduce a display-only masking strategy that hides unstable suffixes from the user interface while retaining them in the draft for verification and potential acceptance. Experiments show that SSBD achieves substantial speedup over standard re-translation while maintaining comparable translation quality, without architectural changes, auxiliary models, or extra fine-tuning.

[109] QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs

David Beauchemin, Pier-Luc Veilleux, Johanna-Pascale Roy, Richard Khoury

Main category: cs.CL

TL;DR: QFrBLiMP is a Quebec-French benchmark with 1,761 minimal pairs to evaluate LLMs’ grammatical knowledge, revealing performance gaps in semantic understanding and dialectal variations.

Details

Motivation: To evaluate LLMs' linguistic knowledge of Quebec-French grammatical phenomena and compare their competency with human native speakers, addressing dialect-specific evaluation needs.

Method: Created 1,761 minimal pairs from official Quebec government resources, annotated by 12 native speakers. Evaluated LLMs by comparing probability assignments to grammatical vs ungrammatical sentences across 20 linguistic phenomena.

Result: Grammatical competence scales with model size but models consistently fail on phenomena requiring deep semantic understanding. Most models show significant performance degradation on Quebec-French vs standard French, though top models demonstrate cross-dialectal robustness.

Conclusion: QFrBLiMP reveals critical limitations in LLMs’ semantic understanding and highlights dialectal performance variations, establishing a valuable benchmark for Quebec-French linguistic evaluation.

Abstract: In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate LLMs’ linguistic knowledge of prominent grammatical phenomena in Quebec-French. QFrBLiMP comprises 1,761 minimal pairs annotated with 20 LPs. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by 12 Quebec-French native speakers, who select the sentence they consider grammatical from the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation. Finally, our statistical analysis comparing QFrBLiMP and MultiBLiMP reveals a significant performance degradation for most models on Quebec-French; however, the most capable models remain within the statistical significance interval, demonstrating cross-dialectal robustness.

[110] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Xiangyu Peng, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Chien-Sheng Wu

Main category: cs.CL

TL;DR: UniDoc-Bench is a large-scale, realistic benchmark for multimodal retrieval-augmented generation (MM-RAG) built from real PDF pages, enabling comprehensive evaluation of text, image, and multimodal approaches for document-centric tasks.

Details

Motivation: Current MM-RAG evaluations are fragmented - focusing on either text or images in isolation, or simplified multimodal setups, failing to capture real-world document-centric multimodal use cases where knowledge is distributed across text, tables, and figures.

Method: Built from real-world PDF pages across domains, the pipeline extracts and links evidence from text, tables, and figures, then generates multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning. All QA pairs are validated by multiple human annotators and expert adjudication.

Result: Multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, showing that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate.

Conclusion: UniDoc-Bench enables apples-to-apples comparison across four paradigms and reveals when/how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

Abstract: Multimodal retrieval-augmented Generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented – focusing on either text or images in isolation, or simplified multimodal setup, failing to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from $k$ real-world PDF pages across domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, all of QA pairs are validated by multiple human annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: 1) text-only, 2) image-only, 3) \emph{multimodal} text-image fusion and 4) multimodal joint retrieval – under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. UniDoc-Bench can also be used to evaluate Visual Question Answering (VQA) tasks. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

[111] Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Nuwan I. Senaratna

Main category: cs.CL

TL;DR: Open multilingual document collection from Sri Lanka with 247,818 documents across 26 datasets in Sinhala, Tamil, and English, covering parliamentary, legal, government, news, and tourism data.

Details

Motivation: To provide open, machine-readable resources for research in computational linguistics, legal analytics, socio-political studies, and multilingual NLP by addressing the lack of comprehensive datasets from Sri Lanka.

Method: Created a collection pipeline to gather documents from various Sri Lankan sources, organized into 26 datasets across three languages, with daily updates and mirroring on GitHub and Hugging Face.

Result: Successfully compiled 247,818 documents (67.6 GB) covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics, with version v2026-01-03-0933.

Conclusion: This open collection provides valuable resources for multilingual NLP research and various analytical studies on Sri Lanka, with ongoing maintenance and ethical considerations addressed.

Abstract: We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 247,818 documents (67.6 GB) across 26 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2026-01-03-0933.

[112] Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda

Main category: cs.CL

TL;DR: Activation steering can suppress LLMs’ evaluation-awareness, making them act like they’re deployed during safety evaluations, improving reliability.

Details

Motivation: LLMs can detect when they're being evaluated and adjust behavior to appear more aligned, compromising safety evaluation reliability.

Method: Two-step training: 1) Continued pretraining on documents with factual descriptions of model behavior differences between evaluation/deployment, 2) Expert iteration training to use Python type hints in evaluation settings. Then use activation steering with vectors from original model to suppress evaluation-awareness.

Result: Activation steering successfully suppressed evaluation awareness, making the model act like it’s deployed even when evaluation cues are present.

Conclusion: AI evaluators could improve safety evaluation reliability by steering models to act like they’re deployed, preventing evaluation-aware behavior manipulation.

Abstract: Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM’s activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.

[113] Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs

Hang Lei, Shengyi Zong, Zhaoyan Li, Ziren Zhou, Hao Liu

Main category: cs.CL

TL;DR: DSR framework decomposes screenplay generation into two stages: creative narrative generation from outlines, then format conversion to professional screenplay structure, achieving 75% win rate against strong baselines.

Details

Motivation: Direct end-to-end LLM generation fails to produce professional screenplays because it forces models to simultaneously master creative narrative construction and rigid format adherence, resulting in superficially styled but structurally deficient outputs.

Method: Dual-Stage Refinement (DSR) framework: Stage 1 transforms brief outlines into rich novel-style prose (creative narrative generation). Stage 2 refines this narrative into professionally formatted screenplay (format conversion). Uses hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, and forward synthesis generates narrative texts as training targets.

Result: Blind evaluations by professional screenwriters show DSR achieves 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance.

Conclusion: Decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains, enabling high-quality screenplay generation by separating creative and formatting capabilities.

Abstract: The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.

[114] SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications

Edouard Lansiaux, Antoine Simonet, Eric Wiel

Main category: cs.CL

TL;DR: Static token lookup method achieves 1.12ms latency for text embeddings with 60.6 MTEB score (89% of contextual model quality), delivering 50k RPS throughput via optimized Rust implementation.

Details

Motivation: Enable real-time embedding applications requiring sub-5ms latency by developing a fast, efficient alternative to contextual models that maintains reasonable quality while dramatically improving speed.

Method: Static token lookup methodology for text embedding generation using Rust implementation with static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization.

Result: Achieves 1.12 ms p50 latency for single text embeddings with 60.6 MTEB average score (89% of contextual model quality), 50k RPS throughput, 90.1% AP for duplicate detection, 76.1% Spearman correlation for semantic similarity, and domain-specific performance ranging from 75% to 131% of baseline.

Conclusion: The static token lookup system successfully enables real-time embedding applications where sub-5ms latency is critical, providing a practical balance between speed and quality for production deployment.

Abstract: We present a static token lookup methodology for text embedding generation that achieves 1.12 ms p50 latency for single text embeddings while maintaining 60.6 MTEB average score across 8 representative tasks, corresponding to 89% of contextual model quality. The Rust implementation delivers 50,000 requests per second throughput through static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP), strong semantic similarity (76.1% Spearman correlation), and domain-specific performance ranging from 75% to 131% of baseline across specialized domains. The system enables real-time embedding applications where sub-5ms latency is critica

[115] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Xinming Wang, Jian Xu, Bin Yu, Sheng Lian, Hongzhu Yi, Yi Chen, Yingjian Zhu, Boran Wang, Hongming Yang, Han Hu, Xu-Yao Zhang, Cheng-Lin Liu

Main category: cs.CL

TL;DR: MR-ALIGN is a meta-reasoning alignment framework that improves factuality in large reasoning models by addressing the reasoning-answer hit gap, where models identify correct facts during reasoning but fail to use them in final answers.

Details

Motivation: Large reasoning models show strong complex reasoning capabilities but have limited gains on evidence-dependent factual questions due to a "reasoning-answer hit gap" - they identify correct facts during reasoning but fail to incorporate them into final responses, reducing factual fidelity.

Method: MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at atomic thinking segments. This re-weights token-level signals into probability-aware segment scores to encourage coherent reasoning trajectories.

Result: Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning.

Conclusion: Aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in large reasoning models.

Abstract: Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

[116] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung

Main category: cs.CL

TL;DR: The paper proposes 1PNS (one problem, multiple solutions) training paradigm and Reasoning Path Divergence (RPD) metric to increase diversity in LLM reasoning outputs, improving Test-Time Scaling effectiveness.

Details

Motivation: Current LLM training with "one problem, one solution" (1P1S) creates low diversity in model outputs, limiting sampling effectiveness and restricting exploration space for RL stages. This homogenization becomes a bottleneck for Test-Time Scaling.

Method: Introduces 1PNS training paradigm that exposes models to diverse valid reasoning trajectories. Proposes Reasoning Path Divergence (RPD), a step-level metric to measure semantic differences between multi-step chains of thought. Uses RPD to curate maximally diverse solution sets for fine-tuning Qwen3-4B-Base.

Result: RPD-selected training yields more varied outputs and higher pass@k, with average +2.80% gain in pass@16 over 1P1S baseline and +4.99% gain on AIME24, demonstrating 1PNS amplifies Test-Time Scaling effectiveness.

Conclusion: The 1PNS paradigm with RPD metric successfully addresses diversity limitations in LLM reasoning, improving Test-Time Scaling performance by exposing models to varied reasoning paths and enabling better exploration.

Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common “one problem, one solution” (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. This homogenization not only limits sampling effectiveness but also restricts the exploration space for subsequent Reinforcement Learning (RL) stages. To address this, we propose a “one problem, multiple solutions” (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .

[117] VISTA Score: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White

Main category: cs.CL

TL;DR: VISTA is a new framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking in multi-turn dialogues.

Details

Motivation: Existing metrics for hallucination detection either evaluate isolated responses or treat unverifiable content as errors, limiting their effectiveness for multi-turn dialogue evaluation where factual reliability is crucial.

Method: VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining).

Result: Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms improved annotator agreement and reveals inconsistencies in existing benchmarks.

Conclusion: By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems, addressing the challenge of hallucination in conversational AI.

Abstract: Hallucination–defined here as generating statements unsupported or contradicted by available evidence or conversational context–remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

[118] HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li

Main category: cs.CL

TL;DR: HaluMem is the first operation-level hallucination evaluation benchmark for memory systems, featuring three tasks (extraction, updating, QA) to localize hallucinations across different operational stages, with large-scale multi-turn interaction datasets.

Details

Motivation: Current memory systems in AI (LLMs, agents) suffer from hallucinations (fabrication, errors, conflicts, omissions) during storage/retrieval, but existing evaluations are end-to-end QA which can't localize where in the memory pipeline hallucinations occur.

Method: Introduced HaluMem benchmark with three evaluation tasks: memory extraction, memory updating, and memory question answering. Constructed two large-scale datasets (HaluMem-Medium and HaluMem-Long) with ~15k memory points, 3.5k multi-type questions, and extremely long dialogues (1.5k-2.6k turns, >1M tokens).

Result: Empirical studies show existing memory systems tend to generate and accumulate hallucinations during extraction and updating stages, which then propagate errors to the question answering stage.

Conclusion: Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability, with HaluMem providing the necessary evaluation framework.

Abstract: Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.

[119] Diagnosing and Mitigating Semantic Inconsistencies in Wikidata’s Classification Hierarchy

Shixiong Zhao, Hideaki Takeda

Main category: cs.CL

TL;DR: The paper analyzes taxonomic inconsistencies in Wikidata’s knowledge graph and proposes validation methods to identify classification errors, over-generalized subclass links, and redundant connections.

Details

Motivation: Wikidata's open editing policy and integration of diverse data sources have made it a central knowledge resource, but this openness has also led to taxonomic inconsistencies that need systematic identification and correction.

Method: Proposes a novel validation method to detect classification errors, over-generalized subclass links, and redundant connections in specific Wikidata domains. Also introduces an evaluation criterion for determining when issues warrant correction and develops a system for inspecting taxonomic relationships.

Result: The study confirms the presence of taxonomic inconsistencies in Wikidata and demonstrates the effectiveness of the proposed validation approach for identifying specific types of classification problems.

Conclusion: The research provides tools and criteria for improving Wikidata’s taxonomic consistency while leveraging its crowdsourced nature, enabling users to inspect and potentially correct relationship issues in the knowledge graph.

Abstract: Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform’s crowdsourced nature to its full potential.

[120] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang

Main category: cs.CL

TL;DR: SGASA framework uses model-generated safety guidelines and adaptive fine-tuning to improve reasoning models’ defense against adversarial jailbreak prompts while reducing unnecessary refusals of benign requests.

Details

Motivation: Reasoning models have strong capabilities but remain vulnerable to adversarial jailbreak prompts that can bypass safety mechanisms and generate harmful content. There's a need for adaptive safety alignment that allows models to autonomously reinforce defenses against adversarial inputs.

Method: SGASA framework with two stages: 1) Data Pre-synthesis generates safety guidelines and augmented prompts, 2) Alignment Fine-tuning uses Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed guidelines into the model.

Result: Extensive experiments across multiple datasets show SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

Conclusion: SGASA provides an effective framework for adaptive safety alignment that strengthens models’ robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests.

Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models’ ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[121] DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

L. D. M. S. Sai Teja, N. Siva Gopala Krishna, Ufaq Khan, Muhammad Haris Khan, Atul Mishra

Main category: cs.CL

TL;DR: Info-Mask framework detects transitions between human and AI authorship in text using stylometric cues, perplexity signals, and boundary modeling, with adversarial robustness tested on MAS benchmark and interpretable attribution overlays.

Details

Motivation: As LLMs blur boundaries between human and AI text, there's a critical need to identify authorship transitions in mixed-authorship content for authenticity, trust, and human oversight purposes.

Method: Info-Mask integrates stylometric cues, perplexity-driven signals, and structured boundary modeling. Includes adversarial benchmark MAS, Human-Interpretable Attribution overlays, and human study for interpretability assessment.

Result: Info-Mask significantly improves span-level robustness under adversarial conditions across multiple architectures, establishing new baselines while revealing remaining challenges in mixed-authorship detection.

Conclusion: The work demonstrates promise and limitations of adversarially robust, interpretable mixed-authorship detection, with important implications for trust and oversight in human-AI co-authorship scenarios.

Abstract: In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.

[122] Do You Feel Comfortable? Detecting Hidden Conversational Escalation in AI Chatbots

Jihyung Park, Saleh Afroogh, David Atkinson, Junfeng Jiao

Main category: cs.CL

TL;DR: GAUGE is a logit-based framework for real-time detection of hidden conversational escalation in LLM interactions, addressing implicit emotional harm that traditional toxicity filters miss.

Details

Motivation: LLMs are increasingly used as emotional companions, but they can cause implicit harm through repeated emotional reinforcement or affective drift that gradually escalates distress. Traditional toxicity filters fail to detect this, and existing guardrail mechanisms using external classifiers or clinical rubrics lag behind real-time conversational dynamics.

Method: GAUGE (Guarding Affective Utterance Generation Escalation) is a logit-based framework that measures how an LLM’s output probabilistically shifts the affective state of a dialogue in real-time.

Result: The paper proposes GAUGE as a solution for detecting hidden conversational escalation, but the abstract doesn’t provide specific experimental results or performance metrics.

Conclusion: GAUGE addresses the gap in existing safety mechanisms by providing real-time detection of implicit emotional harm in LLM conversations through probabilistic analysis of affective state shifts.

Abstract: Large Language Models (LLM) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions. Even in the absence of explicit toxicity, repeated emotional reinforcement or affective drift can gradually escalate distress in a form of \textit{implicit harm} that traditional toxicity filters fail to detect. Existing guardrail mechanisms often rely on external classifiers or clinical rubrics that may lag behind the nuanced, real-time dynamics of a developing conversation. To address this gap, we propose GAUGE (Guarding Affective Utterance Generation Escalation), logit-based framework for the real-time detection of hidden conversational escalation. GAUGE measures how an LLM’s output probabilistically shifts the affective state of a dialogue.

[123] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun

Main category: cs.CL

TL;DR: EDU-based Context Compressor: A novel framework that compresses long contexts by first structuring text into Elementary Discourse Units (EDUs) trees, then selecting query-relevant sub-trees, achieving better structural understanding and downstream task performance than existing methods.

Details

Motivation: Long contexts in LLMs cause high computational costs and noise. Existing compression methods either disrupt local coherence through token removal or suffer from positional bias and incompatibility with closed-source APIs.

Method: Two-step approach: 1) LingoEDU transforms linear text into structural relation trees of Elementary Discourse Units (EDUs) anchored to source indices to prevent hallucination. 2) Lightweight ranking module selects query-relevant sub-trees for linearization.

Result: Achieves state-of-the-art structural prediction accuracy, significantly outperforms frontier LLMs while reducing costs. Structure-aware compression enhances performance across downstream tasks including long-context tasks and complex Deep Search scenarios.

Conclusion: The EDU-based Context Compressor effectively addresses limitations of existing compression techniques by preserving both global structure and fine-grained details through explicit structural representation, offering a practical solution for long-context applications.

Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.

[124] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen

Main category: cs.CL

TL;DR: SGM is a white-box neuron-level intervention method that selectively recalibrates toxic expert neurons in multimodal LLMs to mitigate toxicity without parameter updates, reducing harmful rates from 48.2% to 2.5% while preserving model performance.

Details

Motivation: Multimodal LLMs inherit toxic, biased, and NSFW signals from weakly curated pretraining data, creating safety risks. Existing training-free detoxification methods struggle with adversarial triggers and lack interpretability.

Method: SGM uses expertise-weighted soft suppression to selectively recalibrate a small set of toxic expert neurons, acting like “safety glasses” for toxic neurons. It neutralizes harmful cross-modal activations without parameter updates and integrates with existing methods as SGM*.

Result: SGM reduces toxicity rates from 48.2% to 2.5% in standard and adversarial conditions while preserving fluency and multimodal reasoning. The combined SGM* defense provides stronger safety performance.

Conclusion: SGM provides an interpretable, low-cost solution for toxicity-controlled multimodal generation that effectively mitigates safety risks without compromising model capabilities, offering a white-box alternative to opaque detoxification methods.

Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.

[125] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

Main category: cs.CL

TL;DR: MMRB2 is the first comprehensive benchmark for multimodal reward models, covering text-to-image, image editing, interleaved generation, and multimodal reasoning tasks with 1,000 expert-annotated preference pairs per task.

Details

Motivation: Reward models are crucial for training LLMs but remain underexplored for multimodal models that handle interleaved image and text sequences, creating a need for comprehensive evaluation benchmarks.

Method: Created MMRB2 benchmark with: (1) practical but challenging prompts, (2) responses from state-of-the-art models and agents, (3) preference pairs with strong human-expert consensus via ensemble filtering strategy. Evaluated existing judges including multimodal LLM-as-a-judge and human-preference-trained models.

Result: Gemini 3 Pro achieves 75-80% accuracy, GPT-5 and Gemini 2.5 Pro reach 66-75%, surpassing GPT-4o (59%). Best open-source model Qwen3-VL-32B matches Gemini 2.5 Flash (64%). Humans achieve >90% accuracy. MMRB2 performance strongly correlates with downstream task success.

Conclusion: MMRB2 provides the first comprehensive benchmark for multimodal reward models, revealing significant gaps between current models and human performance, and identifying key areas for improvement in reward modeling for multimodal systems.

Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning (“thinking-with-images”), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

[126] FaithLens: Detecting and Explaining Faithfulness Hallucination

Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Main category: cs.CL

TL;DR: FaithLens is an 8B-parameter model for detecting faithfulness hallucinations in LLM outputs, providing both binary predictions and explanations, outperforming larger models like GPT-4.1 while being cost-efficient.

Details

Motivation: Faithfulness hallucination detection is crucial for real-world LLM applications like retrieval-augmented generation and summarization, but existing solutions need improvement in trustworthiness, efficiency, and effectiveness.

Method: 1) Synthesize training data with explanations using advanced LLMs, 2) Apply data filtering for label correctness, explanation quality, and diversity, 3) Fine-tune model on curated data, 4) Further optimize with rule-based reinforcement learning using rewards for both prediction correctness and explanation quality.

Result: FaithLens outperforms advanced models like GPT-4.1 and o3 on 12 diverse tasks, produces high-quality explanations, and achieves a distinctive balance of trustworthiness, efficiency, and effectiveness.

Conclusion: FaithLens provides a cost-efficient and effective solution for faithfulness hallucination detection that improves trustworthiness through joint prediction and explanation capabilities, demonstrating superior performance to much larger models.

Abstract: Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.

[127] AprielGuard

Jaykumar Kasundra, Anjaneya Praharaj, Sourabh Surana, Lakshmi Sirisha Chodisetty, Sourav Sharma, Abhigya Verma, Abhishek Bhardwaj, Debasish Kanhar, Aakash Bhagat, Khalil Slimi, Seganrasan Subramanian, Sathwik Tejaswi Madhusudhan, Ranga Prasad Chenna, Srinivas Sunkara

Main category: cs.CL

TL;DR: AprielGuard is an 8B parameter safeguard model that unifies safety risk detection (toxicity, bias) and adversarial threat detection (prompt injections, jailbreaks) into a single framework, outperforming existing open-source guardrails.

Details

Motivation: Existing moderation tools treat safety risks and adversarial threats as separate problems, limiting robustness and generalizability as LLMs are increasingly deployed in conversational and agentic settings.

Method: AprielGuard is trained on diverse open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces for interpretability.

Result: AprielGuard achieves strong performance across multiple public and proprietary benchmarks, outperforming existing open-source guardrails like Llama-Guard and Granite Guardian, especially in multi-step and reasoning-intensive scenarios.

Conclusion: By releasing the model, the authors aim to advance transparent and reproducible research on reliable safeguards for LLMs through a unified approach to safety and adversarial threat detection.

Abstract: Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.

[128] Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Amirhosein Ghasemabadi, Di Niu

Main category: cs.CL

TL;DR: Gnosis enables LLMs to predict their own mistakes by analyzing internal states during inference, adding minimal parameters and achieving better accuracy than external judges.

Details

Motivation: LLMs often fail to recognize their own mistakes and hallucinations. Existing approaches require external judges, multi-sample consistency, or text-based self-critique, which are computationally expensive or weakly correlated with true correctness.

Method: Gnosis is a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. It passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost (~5M parameters).

Result: Across math reasoning, open-domain QA, and academic knowledge benchmarks (1.7B to 20B parameter models), Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. It also generalizes zero-shot to partial generations for early failure detection.

Conclusion: Reliable correctness cues are intrinsic to the generation process and can be extracted efficiently without external supervision, enabling LLMs to develop self-awareness about their own mistakes.

Abstract: Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.

[129] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization

Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn, Matteo Malgaroli

Main category: cs.CL

TL;DR: Adversarial training framework improves user simulator realism for evaluating mental health chatbots by pitting generator against discriminator, enhancing failure mode detection and distributional alignment.

Details

Motivation: Realistic user simulation is essential for training and evaluating task-oriented dialogue systems, but creating simulators that accurately replicate human behavior and expose system failure modes remains challenging.

Method: Adversarial training framework with competitive dynamic between generator (user simulator) and discriminator, iteratively improving simulator realism through adversarial iterations.

Result: Fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues; adversarial training enhances diversity, distributional alignment, and predictive validity; achieves strong correlation between simulated and real failure rates; discriminator accuracy decreases drastically after three adversarial iterations.

Conclusion: Adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.

Abstract: Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.

[130] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

Zubaida Mohammed Albadani, Mohammed Q. Shormani

Main category: cs.CL

TL;DR: Qulk-clauses in Yemeni Ibbi Arabic are biclausal structures where ‘qulk’ (I said) functions as a clause-embedding predicate selecting a full CP complement, analyzed within Minimalist Program framework.

Details

Motivation: To investigate the syntactic structure of qulk-clauses in Yemeni Ibbi Arabic within the Minimalist Program framework, as these constructions present interesting syntactic properties including morphological fusion, absence of complementizers, and dialect-specific features.

Method: Applying core minimalist operations (Merge, Move, Agree, Spell-out) to provide a layered syntactic analysis, examining how derivation proceeds through computational steps and post-syntactic processes like Morphological Merger, while accounting for dialect-specific features.

Result: The study proposes that qulk-clauses are biclausal structures where qulk functions as a clause-embedding predicate selecting a full CP complement, successfully accounting for their syntactic properties and dialect-specific features like bipartite negation, cliticization, and CP embedding.

Conclusion: The findings contribute to generative syntax/minimalism, raise questions about extending the analysis to addressee-clauses like ‘kil-k’ (you said), and provide insights into the potential universality of minimalist principles across languages.

Abstract: This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning ‘I said,’ introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k ‘you said’. It also provides insights into the possibility of the universality of minimalism.

[131] TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish

Melikşah Türker, A. Ebrar Kızıloğlu, Onur Güngör, Susan Üsküdarlı

Main category: cs.CL

TL;DR: TabiBERT is a new monolingual Turkish encoder based on ModernBERT architecture, trained from scratch on 1 trillion tokens from a multi-domain corpus, achieving state-of-the-art performance on Turkish NLP tasks.

Details

Motivation: Turkish NLP lacks a monolingual encoder trained from scratch with modern architectural advances like RoPE, FlashAttention, and refined normalization that have improved computational efficiency, training stability, and long-context modeling in encoder-only Transformers.

Method: Developed TabiBERT using ModernBERT architecture with Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Pre-trained from scratch on 1 trillion tokens from a curated multi-domain corpus (73% web text, 20% scientific publications, 6% source code, 0.3% mathematical content). Created TabiBench with 28 datasets across 8 task categories for standardized evaluation.

Result: TabiBERT achieves 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing SOTA on 5 of 8 categories. Shows strong gains on question answering (+9.55), code retrieval (+2.41), and academic understanding (+0.66). Achieves +1.47 average improvement over task-specific prior best results. Supports 8,192-token context length (16x BERT) with 2.65x inference speedup and reduced GPU memory consumption.

Conclusion: TabiBERT successfully addresses the gap in Turkish NLP by providing a modern monolingual encoder with improved performance, efficiency, and cross-domain generalization. The release of model weights, configurations, and evaluation code enables transparent and reproducible research for the Turkish NLP community.

Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch, incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). It supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories, with particularly strong gains on question answering (+9.55 points), code retrieval (+2.41 points), and academic understanding (+0.66 points). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

[132] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Dianyun Wang, Qingsen Ma, Yuhu Shang, Zhifeng Lu, Zhenbo Xu, Lechen Ning, Huijia Wu, Zhaofeng He

Main category: cs.CL

TL;DR: SAILS uses sparse autoencoders to disentangle safety features, creates interpretable safety subspace, and initializes LoRA adapters for parameter-efficient safety alignment that matches RLHF performance.

Details

Motivation: Safety alignment is critical for LLM deployment, but current parameter-efficient methods like LoRA underperform full fine-tuning and RLHF due to semantic entanglement where safety directions are intertwined with unrelated concepts.

Method: SAILS leverages Sparse Autoencoders (SAEs) to disentangle representations into monosemantic features, constructs an interpretable safety subspace from SAE decoder directions, and uses this subspace to initialize LoRA adapters for efficient safety alignment.

Result: SAILS achieves up to 99.6% safety rate on Gemma-2-9B, exceeding full fine-tuning by 7.4 points and matching RLHF-based models while updating only 0.19% of parameters and providing interpretability.

Conclusion: SAILS demonstrates that interpretable feature disentanglement enables parameter-efficient safety alignment that matches state-of-the-art performance while providing transparency and requiring minimal parameter updates.

Abstract: Safety alignment – training large language models (LLMs) to refuse harmful requests while remaining helpful – is critical for responsible deployment. Prior work established that safety behaviors are governed by low-rank structures, suggesting parameter-efficient fine-tuning (PEFT) should be well-suited for alignment. However, Low-Rank Adaptation (LoRA) consistently underperforms full fine-tuning and reinforcement learning on safety benchmarks. We attribute this gap to semantic entanglement: safety-relevant directions are intertwined with unrelated concepts due to polysemanticity, impeding implicit subspace identification. To address this, we propose SAILS (Safety Alignment via Interpretable Low-rank Subspace), which leverages Sparse Autoencoders (SAEs) to disentangle representations into monosemantic features, constructs an interpretable safety subspace from SAE decoder directions, and uses it to initialize LoRA adapters. Theoretically, we prove that SAE-based identification achieves arbitrarily small recovery error under monosemanticity assumptions, while direct identification suffers an irreducible error floor. Empirically, SAILS achieves up to 99.6% safety rate on Gemma-2-9B – exceeding full fine-tuning by 7.4 points and matching RLHF-based models – while updating only 0.19% of parameters and providing interpretability.

[133] When in Doubt, Consult: Expert Debate for Sexism Detection via Confidence-Based Routin

Anwar Alajmi, Gabriele Pergola

Main category: cs.CL

TL;DR: A two-stage framework combining targeted training procedures and selective reasoning-based inference to detect subtle, context-dependent sexist content that evades traditional methods.

Details

Motivation: Sexist content online is becoming more subtle and context-dependent, evading traditional detection methods. Interpretation depends on overlapping linguistic, psychological, legal, and cultural dimensions, creating mixed signals in datasets. Label scarcity, class imbalance, and conceptual ambiguity lead to unstable decision boundaries and models overlooking underrepresented forms of harm.

Method: Two-stage framework: (1) Training stage uses class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to handle label imbalance and noisy supervision. (2) Inference stage employs dynamic routing - high-confidence cases classified directly, uncertain instances escalated to Collaborative Expert Judgment (CEJ) module that prompts multiple personas and consolidates reasoning through a judge model.

Result: Achieves state-of-the-art results: +4.48% F1 gain on EDOS Task A, +1.30% on EDOS Task B, and +2.79% improvement in ICM on EXIST 2025 Task 1.1.

Conclusion: The proposed framework effectively addresses the combined challenges of underrepresentation, noise, and conceptual ambiguity in sexist content detection through targeted training procedures and selective reasoning-based inference, outperforming existing methods on multiple benchmarks.

Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with F1 gains of +4.48% and +1.30% on EDOS Tasks A and B, respectively, and a +2.79% improvement in ICM on EXIST 2025 Task 1.1.

[134] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Qipeng Wang, Rui Sheng, Yafei Li, Huamin Qu, Yushi Sun, Min Zhu

Main category: cs.CL

TL;DR: MedKGI is a clinical diagnostic framework that uses medical knowledge graphs to ground LLM reasoning, selects questions based on information gain for efficiency, and maintains structured state tracking to improve diagnostic accuracy and dialogue efficiency.

Details

Motivation: Current LLMs struggle with clinical diagnosis due to three key limitations: generating hallucinated medical content, asking redundant/infficient questions, and losing coherence in multi-turn dialogues, making them ineffective for real clinical diagnostic scenarios.

Method: MedKGI integrates medical knowledge graphs to constrain reasoning to validated ontologies, selects questions based on information gain to maximize diagnostic efficiency, and uses OSCE-format structured state to maintain consistent evidence tracking across dialogue turns.

Result: Experiments show MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy on clinical benchmarks.

Conclusion: MedKGI successfully addresses key limitations of current LLMs in clinical diagnosis by grounding reasoning in verified knowledge, optimizing question selection, and maintaining dialogue coherence, making it a promising framework for clinical diagnostic applications.

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

[135] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said, Muhammad Sammani Sani

Main category: cs.CL

TL;DR: LLM safety alignment doesn’t transfer zero-shot across languages; models show complex interference patterns with reverse linguistic vulnerability and catastrophic temporal reasoning failures, creating dangerous safety gaps for Global South users.

Details

Motivation: The assumption that safety alignment transfers zero-shot from English to other languages is a dangerous blind spot in LLM deployment, especially for critical global infrastructure. Current models may leave Global South users exposed to localized harms due to inadequate multilingual safety testing.

Method: Systematic audit of three state-of-the-art models (GPT-5.1, Gemini 3 Pro, Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios. Employed 2 x 4 factorial design across 1,440 evaluations to test non-linear interactions between language (English vs. Hausa) and temporal framing.

Result: Found reverse linguistic vulnerability (Claude 4.5 Opus safer in Hausa than English), catastrophic temporal reasoning failures, and profound Temporal Asymmetry (past-tense framing bypassed defenses while future-tense triggered hyper-conservative refusals). 9.2x disparity between safest and most vulnerable configurations shows safety is context-dependent.

Conclusion: Current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that expose Global South users. Propose Invariant Alignment as necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the narrative of the multilingual safety gap. Instead of a simple degradation in low-resource settings, we identified a complex interference mechanism in which safety is determined by the intersection of variables. Although the models exhibited a reverse linguistic vulnerability with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal, they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[136] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities

Hongseok Oh, Wonseok Hwang, Kyoung-Woon On

Main category: cs.CL

TL;DR: KCL is a Korean legal reasoning benchmark that separates reasoning ability from domain knowledge by providing question-level precedents, with MCQA and essay components.

Details

Motivation: To create a benchmark that assesses language models' legal reasoning capabilities independently of domain-specific knowledge, enabling more faithful evaluation of reasoning ability separate from parameterized knowledge.

Method: Developed KCL with two components: (1) KCL-MCQA - 283 multiple-choice questions with 1,103 aligned precedents, and (2) KCL-Essay - 169 open-ended generation questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation.

Result: Systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform general-purpose counterparts.

Conclusion: KCL provides a valuable benchmark for assessing legal reasoning independent of domain knowledge, revealing significant performance gaps and demonstrating the advantage of reasoning-specialized models over general-purpose ones.

Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models’ legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.

[137] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan

Main category: cs.CL

TL;DR: Youtu-LLM is a 1.96B parameter lightweight language model pre-trained from scratch with native agentic intelligence, featuring long-context support and a progressive training curriculum that achieves SOTA performance for sub-2B models.

Details

Motivation: To create a lightweight language model that doesn't rely on distillation but instead develops intrinsic reasoning and planning capabilities from scratch, addressing the need for efficient yet powerful agentic models suitable for long-horizon tasks.

Method: Three key technical advancements: 1) Compact MLA architecture with STEM-oriented vocabulary supporting 128k context window, 2) Multi-stage training curriculum progressing from commonsense to STEM to agentic tasks using 11T tokens, 3) Scalable agentic mid-training with diverse trajectory synthesis for math, coding, and tool-use domains.

Result: Youtu-LLM achieves state-of-the-art performance for sub-2B LLMs, showing competitive performance on general benchmarks against larger models and significantly surpassing existing SOTA baselines on agent-specific tasks.

Conclusion: Lightweight models can possess strong intrinsic agentic capabilities when properly designed and trained with systematic approaches, challenging the assumption that only large models can exhibit sophisticated reasoning and planning abilities.

Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled “Commonsense-STEM-Agent” Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

[138] mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang

Main category: cs.CL

TL;DR: mHC restores identity mapping in Hyper-Connections to fix training instability and scalability issues while maintaining performance gains.

Details

Motivation: Hyper-Connections (HC) extend residual connections but lose the identity mapping property, causing training instability, restricted scalability, and memory overhead.

Method: Manifold-Constrained Hyper-Connections (mHC) projects HC’s residual connection space onto a specific manifold to restore identity mapping, with infrastructure optimization for efficiency.

Result: mHC enables effective large-scale training with performance improvements and superior scalability compared to HC.

Conclusion: mHC is a flexible, practical HC extension that advances topological architecture design and foundational model evolution.

Abstract: Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

[139] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

Xiang Gao, Yuguang Yao, Qi Zhang, Kaiwen Dong, Avinash Baidya, Ruocheng Guo, Hilaf Hasson, Kamalika Das

Main category: cs.CL

TL;DR: RIMRULE: A neuro-symbolic approach that distills interpretable rules from LLM failure traces and injects them during inference to improve tool-use performance without weight modification.

Details

Motivation: LLMs struggle with domain-specific tools due to idiosyncratic, under-documented, or private APIs, requiring effective adaptation to task-specific tools.

Method: Dynamic rule injection: LLM proposes rules from failure traces, consolidated using Minimum Description Length objective for generality/conciseness. Rules stored in natural language and structured symbolic form for efficient inference-time retrieval.

Result: Improves accuracy on both seen and unseen tools without modifying LLM weights; outperforms prompting-based adaptation; complements finetuning; rules transferable across different LLM architectures.

Conclusion: RIMRULE enables effective LLM adaptation to domain-specific tools through interpretable rule learning, demonstrating portability of symbolic knowledge across models and complementing existing adaptation methods.

Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.

[140] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang

Main category: cs.CL

TL;DR: Geo-R is a retrieval-free framework for image geolocalization that uses structured geographic reasoning with reinforcement learning, achieving improved accuracy and interpretability without synthetic labels or external retrieval.

Details

Motivation: Existing vision-language models for image geolocalization rely on synthetic reasoning annotations or external image retrieval, which limits interpretability and generalizability. The authors aim to create a more transparent and scalable approach.

Method: Proposes Geo-R framework with: 1) Chain of Region - rule-based hierarchical reasoning that maps GPS coordinates to geographic entities (country, province, city) without synthetic labels, 2) Lightweight reinforcement learning with coordinate-aligned rewards based on Haversine distance for spatially meaningful feedback.

Result: Experimental results across multiple benchmarks confirm Geo-R’s effectiveness, establishing a new retrieval-free paradigm with improved localization accuracy, stronger generalization, and more transparent inference.

Conclusion: Geo-R bridges structured geographic reasoning with direct spatial supervision, creating a scalable and interpretable approach to image geolocalization. The model and code will be publicly available for reproducibility.

Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

[141] CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu, Qiankun Li, Kun Wang, Zhigang Zeng

Main category: cs.CL

TL;DR: CSSBench is a Chinese-specific safety benchmark that evaluates lightweight LLMs against adversarial patterns unique to Chinese (homophones, pinyin, symbol-splitting) across six real-world domains, revealing significant safety vulnerabilities in lightweight models.

Details

Motivation: There's a critical safety evaluation gap for Chinese language models because existing benchmarks focus on English, while real-world Chinese malicious queries use language-specific adversarial patterns (homophones, pinyin, symbol-based splitting) that lightweight models may be particularly vulnerable to in cost-sensitive deployments.

Method: Created CSSBench covering six Chinese-specific domains (illegal activities/compliance, privacy leakage, health misinformation, fraud/hate, adult content, public/political safety) with queries organized into multiple task types. Evaluated popular lightweight LLMs and measured over-refusal behavior to assess safety-induced performance degradation.

Result: Chinese-specific adversarial patterns pose a critical challenge for lightweight LLMs, revealing significant safety vulnerabilities that aren’t captured by English-focused benchmarks. The benchmark provides comprehensive evaluation of LLM safety in Chinese contexts.

Conclusion: CSSBench bridges the safety evaluation gap for Chinese language models by addressing language-specific adversarial patterns, helping ensure robust deployment of lightweight LLMs in real-world Chinese scenarios where cost and on-device constraints are important considerations.

Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

cs.CV

[142] Free Energy-Based Modeling of Emotional Dynamics in Video Advertisements

Takashi Ushio, Kazuhiro Onishi, Hideyoshi Yanagisawa

Main category: cs.CV

TL;DR: Researchers developed a method to estimate emotions from advertising videos using only scene-level features based on free energy principle, quantifying pleasantness, surprise, and habituation without external data.

Details

Motivation: To establish an explainable emotion estimation method for advertising videos that doesn't rely on physiological signals or subjective ratings, enabling better understanding of emotional responses during video viewing.

Method: Used free energy principle to quantify emotions from scene-level expression features of advertising videos. Applied Kullback-Leibler divergence (KLD) for prediction error (pleasantness), Bayesian surprise (BS) for belief updates (surprise), and uncertainty (UN) for prior ambiguity (surprise/habituation). Tested on 1,059 food video ads.

Result: KLD reflected pleasantness associated with brand presentation, BS captured surprise from informational complexity, and UN reflected surprise from uncertainty in element types/spatial arrangements. Identified three emotional patterns: uncertain stimulus, sustained high emotion, and momentary peak and decay. Method showed robustness across hyperparameters and generalization to different ad types.

Conclusion: The proposed method successfully quantifies emotional responses from advertising videos using only scene features, providing explainable emotion estimation without external data. This can support creation of more engaging advertising content and be extended with more expression elements and subjective validation.

Abstract: Emotional responses during advertising video viewing are recognized as essential for understanding media effects because they have influenced attention, memory, and purchase intention. To establish a methodological basis for explainable emotion estimation without relying on external information such as physiological signals or subjective ratings, we have quantified “pleasantness,” “surprise,” and “habituation” solely from scene-level expression features of advertising videos, drawing on the free energy(FE) principle, which has provided a unified account of perception, learning, and behavior. In this framework, Kullback-Leibler divergence (KLD) has captured prediction error, Bayesian surprise (BS) has captured belief updates, and uncertainty (UN) has reflected prior ambiguity, and together they have formed the core components of FE. Using 1,059 15 s food video advertisements, the experiments have shown that KLD has reflected “pleasantness” associated with brand presentation, BS has captured “surprise” arising from informational complexity, and UN has reflected “surprise” driven by uncertainty in element types and spatial arrangements, as well as by the variability and quantity of presented elements. This study also identified three characteristic emotional patterns, namely uncertain stimulus, sustained high emotion, and momentary peak and decay, demonstrating the usefulness of the proposed method. Robustness across nine hyperparameter settings and generalization tests with six types of Japanese advertising videos (three genres and two durations) confirmed that these tendencies remained stable. This work can be extended by integrating a wider range of expression elements and validating the approach through subjective ratings, ultimately guiding the development of technologies that can support the creation of more engaging advertising videos.

[143] Deepfake Detection with Multi-Artifact Subspace Fine-Tuning and Selective Layer Masking

Xiang Zhang, Wenliang Weng, Daoyong Fu, Ziqiang Li, Zhangjie Fu

Main category: cs.CV

TL;DR: MASM proposes a deepfake detection method using Multi-Artifact Subspaces and selective layer masks to decouple semantic and artifact representations, improving cross-dataset generalization by preserving semantic stability while learning diverse forgery patterns.

Details

Motivation: Deepfake detection struggles with cross-dataset and real-world scenarios due to high diversity of artifact distributions from different forgery methods. Pretrained models disrupt their original semantic structures when adapting to new artifacts, and existing approaches fail to effectively model diverse forgery artifacts while preserving semantic stability.

Method: MASM uses singular value decomposition to partition pretrained weights into a stable semantic principal subspace and multiple learnable artifact subspaces. It introduces selective layer masks to adaptively regulate layer updates based on each artifact subspace’s learning state, preventing overfitting. Orthogonality and spectral consistency constraints regularize artifact subspaces to learn complementary representations while maintaining stable spectral structure.

Result: The method improves generalization robustness in cross-dataset scenarios by explicitly decoupling semantic representations from artifact representations and constraining artifact subspace fitting strength, enabling better modeling of diverse forgery patterns while preserving semantic stability.

Conclusion: MASM effectively addresses deepfake detection challenges by decoupling semantic and artifact representations through subspace decomposition and adaptive regularization, providing a robust solution for cross-dataset generalization in complex real-world scenarios.

Abstract: Deepfake detection still faces significant challenges in cross-dataset and real-world complex scenarios. The root cause lies in the high diversity of artifact distributions introduced by different forgery methods, while pretrained models tend to disrupt their original general semantic structures when adapting to new artifacts. Existing approaches usually rely on indiscriminate global parameter updates or introduce additional supervision signals, making it difficult to effectively model diverse forgery artifacts while preserving semantic stability. To address these issues, this paper proposes a deepfake detection method based on Multi-Artifact Subspaces and selective layer masks (MASM), which explicitly decouples semantic representations from artifact representations and constrains the fitting strength of artifact subspaces, thereby improving generalization robustness in cross-dataset scenarios. Specifically, MASM applies singular value decomposition to model weights, partitioning pretrained weights into a stable semantic principal subspace and multiple learnable artifact subspaces. This design enables decoupled modeling of different forgery artifact patterns while preserving the general semantic subspace. On this basis, a selective layer mask strategy is introduced to adaptively regulate the update behavior of corresponding network layers according to the learning state of each artifact subspace, suppressing overfitting to any single forgery characteristic. Furthermore, orthogonality constraints and spectral consistency constraints are imposed to jointly regularize multiple artifact subspaces, guiding them to learn complementary and diverse artifact representations while maintaining a stable overall spectral structure.

[144] Can Generative Models Actually Forge Realistic Identity Documents?

Alexander Vinogradov

Main category: cs.CV

TL;DR: Current generative models can create visually convincing identity documents but fail to achieve forensic-level authenticity needed to bypass verification systems.

Details

Motivation: To assess whether contemporary open-source diffusion models can produce identity document forgeries that could realistically bypass human or automated verification systems, addressing public concerns about misuse for document forgery.

Method: Evaluated text-to-image and image-to-image generation pipelines using multiple publicly available generative model families including Stable Diffusion, Qwen, Flux, Nano-Banana, and others to test document forgery capabilities.

Result: While generative models can simulate surface-level document aesthetics, they fail to reproduce structural and forensic authenticity needed to bypass verification systems.

Conclusion: The risk of generative identity document deepfakes achieving forensic-level authenticity may be overestimated, highlighting the importance of collaboration between ML practitioners and document-forensics experts for realistic risk assessment.

Abstract: Generative image models have recently shown significant progress in image realism, leading to public concerns about their potential misuse for document forgery. This paper explores whether contemporary open-source and publicly accessible diffusion-based generative models can produce identity document forgeries that could realistically bypass human or automated verification systems. We evaluate text-to-image and image-to-image generation pipelines using multiple publicly available generative model families, including Stable Diffusion, Qwen, Flux, Nano-Banana, and others. The findings indicate that while current generative models can simulate surface-level document aesthetics, they fail to reproduce structural and forensic authenticity. Consequently, the risk of generative identity document deepfakes achieving forensic-level authenticity may be overestimated, underscoring the value of collaboration between machine learning practitioners and document-forensics experts in realistic risk assessment.

[145] Pediatric Pneumonia Detection from Chest X-Rays:A Comparative Study of Transfer Learning and Custom CNNs

Agniv Roy Choudhury

Main category: cs.CV

TL;DR: Fine-tuned ResNet50 achieves near-perfect accuracy (99.43%) for pediatric pneumonia detection from chest X-rays, outperforming custom CNNs and frozen-backbone transfer learning models.

Details

Motivation: Pneumonia causes over 700,000 deaths annually in children under five, but accurate diagnosis is limited by radiologist availability and variability in interpretation.

Method: Used 5,216 pediatric chest X-rays split 80/10/10. Compared custom CNNs with transfer learning (ResNet50, DenseNet121, EfficientNet-B0) in frozen-backbone and fine-tuning regimes. Evaluated with accuracy, F1-score, AUC, and Grad-CAM visualizations.

Result: Fine-tuned ResNet50 achieved best performance: 99.43% accuracy, 99.61% F1-score, 99.93% AUC with only 3 misclassifications. Fine-tuning outperformed frozen-backbone by 5.5 percentage points on average. Grad-CAM confirmed clinically relevant lung regions.

Conclusion: Transfer learning with fine-tuning substantially outperforms CNNs trained from scratch, showing near-perfect accuracy for pediatric pneumonia detection. Has strong potential as screening tool in resource-limited settings. Future validation needed on multi-center and adult datasets.

Abstract: Pneumonia is a leading cause of mortality in children under five, with over 700,000 deaths annually. Accurate diagnosis from chest X-rays is limited by radiologist availability and variability. Objective: This study compares custom CNNs trained from scratch with transfer learning (ResNet50, DenseNet121, EfficientNet-B0) for pediatric pneumonia detection, evaluating frozen-backbone and fine-tuning regimes. Methods: A dataset of 5,216 pediatric chest X-rays was split 80/10/10 for training, validation, and testing. Seven models were trained and assessed using accuracy, F1-score, and AUC. Grad-CAM visualizations provided explainability. Results: Fine-tuned ResNet50 achieved the best performance: 99.43% accuracy, 99.61% F1-score, and 99.93% AUC, with only 3 misclassifications. Fine-tuning outperformed frozen-backbone models by 5.5 percentage points on average. Grad-CAM confirmed clinically relevant lung regions guided predictions. Conclusions: Transfer learning with fine-tuning substantially outperforms CNNs trained from scratch for pediatric pneumonia detection, showing near-perfect accuracy. This system has strong potential as a screening tool in resource-limited settings. Future work should validate these findings on multi-center and adult datasets. Keywords: Pneumonia detection, deep learning, transfer learning, CNN, chest X-ray, pediatric diagnosis, ResNet, DenseNet, EfficientNet, Grad-CAM.

[146] LinMU: Multimodal Understanding Made Linear

Hongjie Wang, Niraj K. Jha

Main category: cs.CV

TL;DR: LinMU is a linear-complexity Vision-Language Model that replaces quadratic self-attention with a dual-branch M-MATE block, achieving comparable performance to global-attention VLMs while significantly improving inference efficiency for high-resolution images and long videos.

Details

Motivation: Current Vision-Language Models suffer from quadratic complexity of self-attention, making them prohibitively expensive for high-resolution images and long-context videos, and preventing deployment on edge devices.

Method: LinMU replaces self-attention layers with M-MATE blocks: dual-branch modules combining bidirectional state-space models (Flex-MA) for global context and Swin-style window attention (Local-Swin) for local correlations. Uses three-stage distillation to transform pre-trained VLMs into LinMU architecture.

Result: LinMU matches teacher model performance on MMMU, TextVQA, LongVideoBench, Video-MME benchmarks while reducing Time-To-First-Token by up to 2.7× and improving token throughput by up to 9.0× on minute-length videos.

Conclusion: State-of-the-art multimodal reasoning can be achieved without quadratic attention, enabling efficient long-context VLMs for high-resolution images and long videos, with potential for edge device deployment.

Abstract: Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.

[147] Unified Review and Benchmark of Deep Segmentation Architectures for Cardiac Ultrasound on CAMUS

Zahid Ullah, Muhammad Hilal, Eunsoo Lee, Dragan Pamucar, Jihie Kim

Main category: cs.CV

TL;DR: Benchmark comparing U-Net, Attention U-Net, and TransUNet for cardiac ultrasound segmentation on CAMUS dataset with standardized preprocessing and evaluation protocols.

Details

Motivation: Address the lack of unified, reproducible experimental benchmarks in cardiac imaging and deep learning, despite numerous review papers summarizing advances in the field.

Method: Controlled comparison of three architectures (U-Net, Attention U-Net, TransUNet) on CAMUS dataset with multiple preprocessing routes: native NIfTI volumes, 16-bit PNG exports, GPT-assisted polygon-based pseudo-labels, and self-supervised pretraining on unlabeled cine frames.

Result: Plain U-Net achieved 94% mean Dice on native NIfTI data, PNG workflow reached 91%. Attention U-Net improved small/low-contrast regions and reduced boundary leakage. TransUNet showed strongest generalization on challenging frames, especially with SSL initialization. Pseudo-labeling improved robustness after confidence filtering.

Conclusion: Provides three contributions: harmonized benchmark of three architectures under standardized conditions, practical guidance on ultrasound data preparation, and outlook on scalable self-supervision and GPT-based annotation pipelines for rapid labeling and quality assurance.

Abstract: Several review papers summarize cardiac imaging and DL advances, few works connect this overview to a unified and reproducible experimental benchmark. In this study, we combine a focused review of cardiac ultrasound segmentation literature with a controlled comparison of three influential architectures, U-Net, Attention U-Net, and TransUNet, on the Cardiac Acquisitions for Multi-Structure Ultrasound Segmentation (CAMUS) echocardiography dataset. Our benchmark spans multiple preprocessing routes, including native NIfTI volumes, 16-bit PNG exports, GPT-assisted polygon-based pseudo-labels, and self-supervised pretraining (SSL) on thousands of unlabeled cine frames. Using identical training splits, losses, and evaluation criteria, a plain U-Net achieved a 94% mean Dice when trained directly on NIfTI data (preserving native dynamic range), while the PNG-16-bit workflow reached 91% under similar conditions. Attention U-Net provided modest improvements on small or low-contrast regions, reducing boundary leakage, whereas TransUNet demonstrated the strongest generalization on challenging frames due to its ability to model global spatial context, particularly when initialized with SSL. Pseudo-labeling expanded the training set and improved robustness after confidence filtering. Overall, our contributions are threefold: a harmonized, apples-to-apples benchmark of U-Net, Attention U-Net, and TransUNet under standardized CAMUS preprocessing and evaluation; practical guidance on maintaining intensity fidelity, resolution consistency, and alignment when preparing ultrasound data; and an outlook on scalable self-supervision and emerging multimodal GPT-based annotation pipelines for rapid labeling, quality assurance, and targeted dataset curation.

[148] Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

Haonan Cai, Yuxuan Luo, Zhouhui Lian

Main category: cs.CV

TL;DR: GAR-Font is an autoregressive framework for few-shot font generation that uses global-aware tokenization, multimodal style encoding with language guidance, and post-refinement to improve structural and stylistic fidelity.

Details

Motivation: Existing few-shot font generation methods struggle with preserving structural integrity and stylistic fidelity from limited references. Autoregressive models are constrained by patch-level tokenization that neglects global dependencies, and current approaches overlook the role of language in conveying stylistic intent during font design.

Method: Proposes GAR-Font with three key components: 1) Global-aware tokenizer capturing both local structures and global stylistic patterns, 2) Multimodal style encoder with lightweight language-style adapter for flexible style control without intensive multimodal pretraining, 3) Post-refinement pipeline to enhance structural fidelity and style coherence.

Result: Extensive experiments show GAR-Font outperforms existing few-shot font generation methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

Conclusion: GAR-Font successfully addresses limitations of existing font generation methods by incorporating global dependencies and multimodal (visual+textual) style guidance, demonstrating superior performance in preserving both structural integrity and stylistic fidelity from few references.

Abstract: Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

[149] Motion-Compensated Latent Semantic Canvases for Visual Situational Awareness on Edge

Igor Lodin, Sergii Filatov, Vira Filatova, Dmytro Filatov

Main category: cs.CV

TL;DR: MCLSC uses motion-gated panoptic segmentation on two latent canvases (static/dynamic) to drastically reduce processing on edge devices while maintaining semantic awareness.

Details

Motivation: Enable visual situational awareness on resource-constrained edge devices by reducing computational overhead of expensive panoptic segmentation while maintaining persistent semantic memory.

Method: Maintains two latent semantic canvases (static accumulating layer, dynamic updating layer) in stabilized baseline coordinates. Uses motion-gated triggering of Mask2Former segmentation - only runs when motion indicates new information. Motion compensation preserves consistent coordinate system.

Result: On 480p clips: reduces segmentation calls by >30x, lowers mean end-to-end processing time by >20x compared to naive per-frame segmentation, while maintaining coherent static/dynamic semantic overlays.

Conclusion: MCLSC enables efficient visual situational awareness on edge devices through motion-gated segmentation and persistent semantic canvases, dramatically reducing computational load while preserving semantic coherence.

Abstract: We propose Motion-Compensated Latent Semantic Canvases (MCLSC) for visual situational awareness on resource-constrained edge devices. The core idea is to maintain persistent semantic metadata in two latent canvases - a slowly accumulating static layer and a rapidly updating dynamic layer - defined in a baseline coordinate frame stabilized from the video stream. Expensive panoptic segmentation (Mask2Former) runs asynchronously and is motion-gated: inference is triggered only when motion indicates new information, while stabilization/motion compensation preserves a consistent coordinate system for latent semantic memory. On prerecorded 480p clips, our prototype reduces segmentation calls by >30x and lowers mean end-to-end processing time by >20x compared to naive per-frame segmentation, while maintaining coherent static/dynamic semantic overlays.

[150] DDNet: A Dual-Stream Graph Learning and Disentanglement Framework for Temporal Forgery Localization

Boyang Zhao, Xin Liao, Jiaxin Chen, Xiaoshuai Wu, Yufeng Wu

Main category: cs.CV

TL;DR: DDNet: A dual-stream graph learning framework for temporal forgery localization that combines local artifact detection with global semantic analysis to pinpoint tampered video segments.

Details

Motivation: Current video forgery detection methods are inaccurate because they focus only on local tampering artifacts, missing global anomalies. As AIGC technology advances, attackers can tamper small video segments, making video-level detection insufficient and necessitating precise temporal localization of forged segments.

Method: DDNet uses a dual-stream graph learning approach: 1) Temporal Distance Stream for local artifact detection, 2) Semantic Content Stream for capturing long-range connections and global anomalies. It also includes Trace Disentanglement and Adaptation (TDA) to isolate generic forgery fingerprints, and Cross-Level Feature Embedding (CLFE) for robust feature fusion across hierarchical levels.

Result: Outperforms state-of-the-art methods by approximately 9% in AP@0.95 on ForgeryNet and TVIL benchmarks, with significant improvements in cross-domain robustness.

Conclusion: The proposed DDNet effectively addresses limitations of local-view methods by combining local and global analysis, achieving superior performance in temporal forgery localization and demonstrating strong cross-domain generalization capabilities.

Abstract: The rapid evolution of AIGC technology enables misleading viewers by tampering mere small segments within a video, rendering video-level detection inaccurate and unpersuasive. Consequently, temporal forgery localization (TFL), which aims to precisely pinpoint tampered segments, becomes critical. However, existing methods are often constrained by \emph{local view}, failing to capture global anomalies. To address this, we propose a \underline{d}ual-stream graph learning and \underline{d}isentanglement framework for temporal forgery localization (DDNet). By coordinating a \emph{Temporal Distance Stream} for local artifacts and a \emph{Semantic Content Stream} for long-range connections, DDNet prevents global cues from being drowned out by local smoothness. Furthermore, we introduce Trace Disentanglement and Adaptation (TDA) to isolate generic forgery fingerprints, alongside Cross-Level Feature Embedding (CLFE) to construct a robust feature foundation via deep fusion of hierarchical features. Experiments on ForgeryNet and TVIL benchmarks demonstrate that our method outperforms state-of-the-art approaches by approximately 9% in AP@0.95, with significant improvements in cross-domain robustness.

[151] VL-OrdinalFormer: Vision Language Guided Ordinal Transformers for Interpretable Knee Osteoarthritis Grading

Zahid Ullah, Jihie Kim

Main category: cs.CV

TL;DR: VLOrdinalFormer: A vision-language ordinal learning framework for automated knee osteoarthritis grading that combines ViT backbone with CLIP semantic alignment to improve accuracy on subtle early-stage distinctions (KL1 vs KL2).

Details

Motivation: Knee osteoarthritis (KOA) severity assessment using KL grading system is critical but challenging due to subtle radiographic distinctions between early stages (KL1 vs KL2), leading to inter-observer variability among radiologists.

Method: Combines ViT L16 backbone with CORAL-based ordinal regression and CLIP-driven semantic alignment module to incorporate clinically meaningful textual concepts (joint space narrowing, osteophyte formation, subchondral sclerosis). Uses stratified five-fold cross validation, class-aware reweighting, and test-time augmentation with global threshold optimization.

Result: Achieves state-of-the-art performance on OAI kneeKL224 dataset, outperforming CNN and ViT baselines in macro F1 score and overall accuracy. Shows substantial gains for KL1 and KL2 without compromising accuracy for mild/severe cases. Interpretability analyses confirm attention to clinically relevant anatomical regions.

Conclusion: Vision-language aligned ordinal transformers show potential as reliable and interpretable tools for KOA grading and disease progression assessment in routine radiological practice.

Abstract: Knee osteoarthritis (KOA) is a leading cause of disability worldwide, and accurate severity assessment using the Kellgren Lawrence (KL) grading system is critical for clinical decision making. However, radiographic distinctions between early disease stages, particularly KL1 and KL2, are subtle and frequently lead to inter-observer variability among radiologists. To address these challenges, we propose VLOrdinalFormer, a vision language guided ordinal learning framework for fully automated KOA grading from knee radiographs. The proposed method combines a ViT L16 backbone with CORAL based ordinal regression and a Contrastive Language Image Pretraining (CLIP) driven semantic alignment module, allowing the model to incorporate clinically meaningful textual concepts related to joint space narrowing, osteophyte formation, and subchondral sclerosis. To improve robustness and mitigate overfitting, we employ stratified five fold cross validation, class aware re weighting to emphasize challenging intermediate grades, and test time augmentation with global threshold optimization. Experiments conducted on the publicly available OAI kneeKL224 dataset demonstrate that VLOrdinalFormer achieves state of the art performance, outperforming CNN and ViT baselines in terms of macro F1 score and overall accuracy. Notably, the proposed framework yields substantial performance gains for KL1 and KL2 without compromising classification accuracy for mild or severe cases. In addition, interpretability analyses using Grad CAM and CLIP similarity maps confirm that the model consistently attends to clinically relevant anatomical regions. These results highlight the potential of vision language aligned ordinal transformers as reliable and interpretable tools for KOA grading and disease progression assessment in routine radiological practice.

[152] VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition

Hongbo Jin, Kuanwei Lin, Wenhao Zhang, Yichen Jin, Ge Li

Main category: cs.CV

TL;DR: VideoCuRL: A 2D curriculum RL framework for VideoLLMs that decomposes difficulty into visual perception load and cognitive reasoning depth, using efficient proxies and diagonal wavefront scheduling for improved video understanding.

Details

Motivation: Current RL paradigms for VideoLLMs rely on random data shuffling or naive scalar difficulty metrics, which fail to disentangle the orthogonal challenges of visual temporal perception load and cognitive reasoning depth in video understanding.

Method: Proposes VideoCuRL framework that: 1) Decomposes difficulty into two axes (visual perception load and cognitive reasoning depth), 2) Uses training-free proxies (optical flow/keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity) to map data onto 2D curriculum grid, 3) Implements competence-aware Diagonal Wavefront strategy for training scheduling, and 4) Introduces Dynamic Sparse KL and Structured Revisiting to stabilize training.

Result: VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks, while eliminating prohibitive inference overhead of generation-based curricula.

Conclusion: VideoCuRL offers a scalable solution for robust video post-training by providing an efficient 2D curriculum framework that addresses both visual and cognitive challenges in video understanding without the computational burden of generation-based approaches.

Abstract: Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.

[153] Comparative Evaluation of CNN Architectures for Neural Style Transfer in Indonesian Batik Motif Generation: A Comprehensive Study

Happy Gery Pangestu, Andi Prademon Yunus, Siti Khomsah

Main category: cs.CV

TL;DR: Systematic comparison of CNN backbones for Neural Style Transfer on Indonesian batik shows ResNet architectures offer 5-6x faster convergence and 16x fewer FLOPs than VGG while maintaining similar structural preservation, making them more practical for resource-limited deployment.

Details

Motivation: Existing NST approaches for Indonesian batik preservation rely heavily on VGG architectures that have high computational/memory demands, limiting practical deployment in resource-constrained environments. There's a need to evaluate more efficient alternatives.

Method: Conducted 245 controlled experiments comparing five CNN backbones (VGG16, VGG19, Inception V3, ResNet50, ResNet101) using quantitative metrics (SSIM, LPIPS), qualitative assessment, and statistical analysis (ANOVA) to evaluate structural preservation, stylistic behavior, and computational efficiency trade-offs.

Result: Backbone selection doesn’t significantly affect structural similarity (ANOVA p=0.83). ResNet architectures achieve 5-6x faster convergence than VGG with similar perceptual similarity (LPIPS=0.53) and require 16x fewer FLOPs (0.63 vs 10.12 GFLOPs). VGG produces denser painterly textures, ResNet favors geometric stability and stroke preservation, Inception V3 shows intermediate/noisier behavior.

Conclusion: Architectural choice in NST should shift from maximizing stylistic intensity toward efficiency-aware, structure-preserving deployment. ResNet-based backbones provide a practical foundation for scalable, industry-oriented batik generation in resource-limited environments.

Abstract: Neural Style Transfer (NST) provides a computational framework for the digital preservation and generative exploration of Indonesian batik motifs; however, existing approaches remain largely centered on VGG-based architectures whose strong stylistic expressiveness comes at the cost of high computational and memory demands, that limits practical deployment in resource-limited environments. This study presents a systematic comparative analysis of five widely used CNN backbones, namely VGG16, VGG19, Inception V3, ResNet50, and ResNet101, based on 245 controlled experiments combining quantitative metrics, qualitative assessment, and statistical analysis to examine the trade-off between structural preservation, stylistic behavior, and computational efficiency. The results show that backbone selection does not yield statistically significant differences in structural similarity, as confirmed by ANOVA on SSIM (p= 0.83), indicating comparable levels of structural preservation rather than equivalent stylistic quality. Within this context, ResNet-based architectures achieve approximately 5-6x faster convergence than VGG models while maintaining similar perceptual similarity (LPIPS = 0.53) and requiring over 16x fewer FLOPs (0.63 vs 10.12 GFLOPs). Qualitative analysis reveals consistent stylistic trade-offs, with VGG producing denser painterly textures, ResNet favoring geometric stability and canting stroke preservation with milder stylization, and Inception V3 exhibiting intermediate but noisier behavior. These findings reposition architectural choice in NST from maximizing stylistic intensity toward efficiency-aware and structure-preserving deployment, highlighting ResNet-based backbones as a practical foundation for scalable, industry-oriented batik generation.

[154] CornViT: A Multi-Stage Convolutional Vision Transformer Framework for Hierarchical Corn Kernel Analysis

Sai Teja Erukude, Jane Mascarenhas, Lior Shamir

Main category: cs.CV

TL;DR: CornViT: A three-stage Vision Transformer framework for automated corn kernel grading that achieves 91-94% accuracy across purity, morphology, and embryo orientation tasks, outperforming CNN baselines.

Details

Motivation: Corn kernel grading is critical for seed certification, directional seeding, and breeding but is still predominantly done manually, requiring an automated solution that emulates human hierarchical reasoning.

Method: Three-stage hierarchical CvT-13 framework: Stage 1 (purity classification), Stage 2 (morphology classification for pure kernels), Stage 3 (embryo orientation for pure-flat kernels). Uses ImageNet-22k pretrained backbones with head-only fine-tuning on curated datasets.

Result: Achieves test accuracies of 93.76% (purity), 94.11% (shape), and 91.12% (embryo orientation). Outperforms ResNet-50 (76.56-81.02%) and DenseNet-121 (86.56-89.38%). Framework deployed as Flask web application with interpretable outputs.

Conclusion: CornViT demonstrates convolution-augmented self-attention’s advantages for kernel analysis. Provides complete solution with datasets, code, and web application for automated corn kernel quality assessment in seed workflows.

Abstract: Accurate grading of corn kernels is critical for seed certification, directional seeding, and breeding, yet it is still predominantly performed by manual inspection. This work introduces CornViT, a three-stage Convolutional Vision Transformer (CvT) framework that emulates the hierarchical reasoning of human seed analysts for single-kernel evaluation. Three sequential CvT-13 classifiers operate on 384x384 RGB images: Stage 1 distinguishes pure from impure kernels; Stage 2 categorizes pure kernels into flat and round morphologies; and Stage 3 determines the embryo orientation (up vs. down) for pure, flat kernels. Starting from a public corn seed image collection, we manually relabeled and filtered images to construct three stage-specific datasets: 7265 kernels for purity, 3859 pure kernels for morphology, and 1960 pure-flat kernels for embryo orientation, all released as benchmarks. Head-only fine-tuning of ImageNet-22k pretrained CvT-13 backbones yields test accuracies of 93.76% for purity, 94.11% for shape, and 91.12% for embryo-orientation detection. Under identical training conditions, ResNet-50 reaches only 76.56 to 81.02 percent, whereas DenseNet-121 attains 86.56 to 89.38 percent accuracy. These results highlight the advantages of convolution-augmented self-attention for kernel analysis. To facilitate adoption, we deploy CornViT in a Flask-based web application that performs stage-wise inference and exposes interpretable outputs through a browser interface. Together, the CornViT framework, curated datasets, and web application provide a deployable solution for automated corn kernel quality assessment in seed quality workflows. Source code and data are publicly available.

[155] TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

Meng Chu, Yukang Chen, Haokun Gui, Shaozuo Yu, Yi Wang, Jiaya Jia

Main category: cs.CV

TL;DR: TraveLLaMA is a specialized multimodal language model for travel assistance that introduces TravelQA dataset, Travel-CoT reasoning framework, and achieves significant performance improvements over general-purpose models.

Details

Motivation: Existing multimodal AI systems lack specialized knowledge and contextual understanding of urban environments needed for effective travel planning and tourism assistance.

Method: Developed TravelQA dataset (265k QA pairs with text, vision-language, and expert reasoning), Travel-CoT structured reasoning framework, and fine-tuned state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra).

Result: Achieved 6.2-9.4% base improvements on fine-tuned models, with Travel-CoT providing additional 10.8% accuracy boost. User studies with 500 participants showed System Usability Scale score of 82.5, significantly outperforming general-purpose models.

Conclusion: TraveLLaMA establishes new standards for multimodal travel assistance systems with superior capabilities in contextual recommendations, map interpretation, scene understanding, and practical information delivery.

Abstract: Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions: (1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies. Through fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we achieve 6.2-9.4% base improvements, further enhanced by Travel-CoT reasoning. Our model demonstrates superior capabilities in contextual travel recommendations, map interpretation, and scene understanding while providing practical information such as operating hours and cultural insights. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models and establishing new standards for multimodal travel assistance systems.

[156] Evaluating Contextual Intelligence in Recyclability: A Comprehensive Study of Image-Based Reasoning Systems

Eliot Park, Abhi Kumar, Pranav Rajpurkar

Main category: cs.CV

TL;DR: Vision-language models (GPT-4o, GPT-4o-mini, Claude 3.5) show improved contextual understanding for recycling classification but still have limitations in handling complex scenarios like location-specific rules, contamination, and multi-material objects.

Details

Motivation: Accurate recycling classification remains challenging for the public despite its importance for environmental sustainability. There's a need for better tools to help people determine recyclability and proper disposal methods.

Method: Evaluated three vision-language models (GPT-4o, GPT-4o-mini, Claude 3.5) using a curated image dataset. Tested their ability to match objects to appropriate recycling bins, including physical fit assessment. Investigated performance on challenging scenarios: location-specific guidelines, contamination/damage, and multi-material objects.

Result: Models show significant advancements in contextual understanding compared to previous iterations. They can effectively match objects to appropriate bins and assess physical fit. However, performance varies across challenging scenarios, with limitations in handling location-specific rules, contamination assessment, and multi-material classification.

Conclusion: While current vision-language models represent progress in recycling classification, they still have limitations in complex real-world scenarios. Continued refinement of context-aware models is essential for improving public recycling practices and advancing environmental sustainability.

Abstract: While the importance of efficient recycling is widely acknowledged, accurately determining the recyclability of items and their proper disposal remains a complex task for the general public. In this study, we explore the application of cutting-edge vision-language models (GPT-4o, GPT-4o-mini, and Claude 3.5) for predicting the recyclability of commonly disposed items. Utilizing a curated dataset of images, we evaluated the models’ ability to match objects to appropriate recycling bins, including assessing whether the items could physically fit into the available bins. Additionally, we investigated the models’ performance across several challenging scenarios: (i) adjusting predictions based on location-specific recycling guidelines; (ii) accounting for contamination or structural damage; and (iii) handling objects composed of multiple materials. Our findings highlight the significant advancements in contextual understanding offered by these models compared to previous iterations, while also identifying areas where they still fall short. The continued refinement of context-aware models is crucial for enhancing public recycling practices and advancing environmental sustainability.

[157] Clean-GS: Semantic Mask-Guided Pruning for 3D Gaussian Splatting

Subhankar Mishra

Main category: cs.CV

TL;DR: Clean-GS removes background clutter and floaters from 3D Gaussian Splatting reconstructions using sparse semantic masks, achieving 60-80% model compression while preserving object quality.

Details

Motivation: 3D Gaussian Splatting produces high-quality reconstructions but generates hundreds of thousands of spurious Gaussians (floaters) that obscure objects and inflate model sizes, hindering deployment in bandwidth-constrained applications like web and AR/VR.

Method: Multi-stage approach: (1) whitelist filtering via projection to masked regions using sparse semantic masks (as few as 3 masks, 1% of views), (2) depth-buffered color validation, and (3) neighbor-based outlier removal to isolate target objects from complex scenes.

Result: Achieves 60-80% model compression, reducing file sizes from 125MB to 47MB on Tanks and Temples dataset while maintaining rendering quality, making 3DGS models practical for web deployment and AR/VR applications.

Conclusion: Clean-GS effectively removes background clutter and floaters using minimal semantic information, unlike existing pruning methods that rely on global importance metrics, enabling practical deployment of 3DGS models in bandwidth-constrained scenarios.

Abstract: 3D Gaussian Splatting produces high-quality scene reconstructions but generates hundreds of thousands of spurious Gaussians (floaters) scattered throughout the environment. These artifacts obscure objects of interest and inflate model sizes, hindering deployment in bandwidth-constrained applications. We present Clean-GS, a method for removing background clutter and floaters from 3DGS reconstructions using sparse semantic masks. Our approach combines whitelist-based spatial filtering with color-guided validation and outlier removal to achieve 60-80% model compression while preserving object quality. Unlike existing 3DGS pruning methods that rely on global importance metrics, Clean-GS uses semantic information from as few as 3 segmentation masks (1% of views) to identify and remove Gaussians not belonging to the target object. Our multi-stage approach consisting of (1) whitelist filtering via projection to masked regions, (2) depth-buffered color validation, and (3) neighbor-based outlier removal isolates monuments and objects from complex outdoor scenes. Experiments on Tanks and Temples show that Clean-GS reduces file sizes from 125MB to 47MB while maintaining rendering quality, making 3DGS models practical for web deployment and AR/VR applications. Our code is available at https://github.com/smlab-niser/clean-gs

[158] Efficient Hyperspectral Image Reconstruction Using Lightweight Separate Spectral Transformers

Jianan Li, Wangcai Zhao, Tingfa Xu

Main category: cs.CV

TL;DR: LSST: Lightweight Separate Spectral Transformer for efficient hyperspectral image reconstruction from compressive sensing measurements using spectral-spatial separation and focal spectrum loss.

Details

Motivation: Hyperspectral imaging captures rich spectral data but faces challenges in efficient reconstruction from compressive sensing measurements. Existing methods struggle with computational efficiency while handling both spectral and spatial characteristics effectively.

Method: Divide-and-conquer strategy with LSST architecture: Separate Spectral Transformer Blocks (SSTB) for spectral modeling using Grouped Spectral Self-attention and Spectrum Shuffle, and Lightweight Spatial Convolution Blocks (LSCB) for spatial processing using depth-wise separable convolutions. Also introduces Focal Spectrum Loss for dynamic training weighting.

Result: LSST achieves superior reconstruction performance while requiring fewer FLOPs and parameters compared to existing methods, demonstrating both efficiency and effectiveness.

Conclusion: The proposed LSST framework provides an efficient and effective solution for hyperspectral image reconstruction from compressive sensing measurements by leveraging spectral-spatial separation and novel attention mechanisms.

Abstract: Hyperspectral imaging (HSI) is essential across various disciplines for its capacity to capture rich spectral information. However, efficiently reconstructing hyperspectral images from compressive sensing measurements presents significant challenges. To tackle these, we adopt a divide-and-conquer strategy that capitalizes on the unique spectral and spatial characteristics of hyperspectral images. We introduce the Lightweight Separate Spectral Transformer (LSST), an innovative architecture tailored for efficient hyperspectral image reconstruction. This architecture consists of Separate Spectral Transformer Blocks (SSTB) for modeling spectral relationships and Lightweight Spatial Convolution Blocks (LSCB) for spatial processing. The SSTB employs Grouped Spectral Self-attention and a Spectrum Shuffle operation to effectively manage both local and non-local spectral relationships. Simultaneously, the LSCB utilizes depth-wise separable convolutions and strategic ordering to enhance spatial information processing. Furthermore, we implement the Focal Spectrum Loss, a novel loss weighting mechanism that dynamically adjusts during training to improve reconstruction across spectrally complex bands. Extensive testing demonstrates that our LSST achieves superior performance while requiring fewer FLOPs and parameters, underscoring its efficiency and effectiveness. The source code is available at: https://github.com/wcz1124/LSST.

[159] Four-Stage Alzheimer’s Disease Classification from MRI Using Topological Feature Extraction, Feature Selection, and Ensemble Learning

Faisal Ahmed

Main category: cs.CV

TL;DR: TDA-Alz: A topological data analysis + ensemble learning framework achieves 98.19% accuracy for 4-stage Alzheimer’s disease severity classification from MRI, outperforming deep learning methods without needing data augmentation or heavy computation.

Details

Motivation: Accurate Alzheimer's disease severity classification from MRI is challenging due to limited data and poor interpretability of deep learning models. Current methods rely on deep convolutional architectures with extensive data augmentation, which are computationally expensive and lack transparency.

Method: Proposes TDA-Alz framework using topological data analysis (TDA) to extract topological descriptors capturing intrinsic structural patterns from brain MRI. Performs feature selection to retain discriminative topological features, then uses ensemble learning for robust multiclass classification of four AD stages (non-demented, moderate dementia, mild, very mild).

Result: Achieves 98.19% accuracy and 99.75% AUC on OASIS-1 MRI dataset, outperforming or matching state-of-the-art deep learning methods. Framework requires no data augmentation, pretrained networks, or large computational resources, making it computationally efficient and fast.

Conclusion: TDA-Alz offers a powerful, lightweight, and interpretable alternative to deep learning for MRI-based AD severity classification. Topological features provide greater interpretability by linking directly to structural characteristics, making it suitable for real-world clinical decision-support systems.

Abstract: Accurate and efficient classification of Alzheimer’s disease (AD) severity from brain magnetic resonance imaging (MRI) remains a critical challenge, particularly when limited data and model interpretability are of concern. In this work, we propose TDA-Alz, a novel framework for four-stage Alzheimer’s disease severity classification (non-demented, moderate dementia, mild, and very mild) using topological data analysis (TDA) and ensemble learning. Instead of relying on deep convolutional architectures or extensive data augmentation, our approach extracts topological descriptors that capture intrinsic structural patterns of brain MRI, followed by feature selection to retain the most discriminative topological features. These features are then classified using an ensemble learning strategy to achieve robust multiclass discrimination. Experiments conducted on the OASIS-1 MRI dataset demonstrate that the proposed method achieves an accuracy of 98.19% and an AUC of 99.75%, outperforming or matching state-of-the-art deep learning–based methods reported on OASIS and OASIS-derived datasets. Notably, the proposed framework does not require data augmentation, pretrained networks, or large-scale computational resources, making it computationally efficient and fast compared to deep neural network approaches. Furthermore, the use of topological descriptors provides greater interpretability, as the extracted features are directly linked to the underlying structural characteristics of brain MRI rather than opaque latent representations. These results indicate that TDA-Alz offers a powerful, lightweight, and interpretable alternative to deep learning models for MRI-based Alzheimer’s disease severity classification, with strong potential for real-world clinical decision-support systems.

[160] A UAV-Based Multispectral and RGB Dataset for Multi-Stage Paddy Crop Monitoring in Indian Agricultural Fields

Adari Rama Sukanya, Puvvula Roopesh Naga Sri Sai, Kota Moses, Rimalapudi Sarvendranath

Main category: cs.CV

TL;DR: A large-scale UAV dataset of RGB and multispectral images covering all growth stages of paddy crops in India, with high-resolution (1 cm/pixel) imagery and rich metadata for agricultural research applications.

Details

Motivation: To address the lack of comprehensive, high-resolution UAV datasets covering all growth stages of Indian paddy crops with rich metadata, which is needed for agricultural research applications like targeted spraying, disease analysis, and yield estimation.

Method: Used UAVs equipped with 20MP RGB and 5MP four-band multispectral cameras to capture images over paddy fields in Vijayawada, India. Developed SOPs and checklists for repeatable data acquisition. Collected 42,430 raw images with GPS coordinates, flight altitude, and environmental metadata. Validated images using Pix4D Fields to generate orthomosaic and vegetation index maps (NDVI, NDRE).

Result: Created a dataset of 415 GB with 1 cm/pixel GSD covering 5 acres, spanning nursery to harvesting stages. The dataset includes orthomosaic maps, vegetation index maps, and comprehensive metadata. It’s one of few datasets providing high-resolution images covering all growth stages of Indian paddy crops.

Conclusion: The dataset is publicly available on IEEE DataPort and can support various agricultural research applications including targeted spraying, disease analysis, and yield estimation studies for paddy crops.

Abstract: We present a large-scale unmanned aerial vehicle (UAV)-based RGB and multispectral image dataset collected over paddy fields in the Vijayawada region, Andhra Pradesh, India, covering nursery to harvesting stages. We used a 20-megapixel RGB camera and a 5-megapixel four-band multispectral camera capturing red, green, red-edge, and near-infrared bands. Standardised operating procedure (SOP) and checklists were developed to ensure repeatable data acquisition. Our dataset comprises of 42,430 raw images (415 GB) captured over 5 acres with 1 cm/pixel ground sampling distance (GSD) with associated metadata such as GPS coordinates, flight altitude, and environmental conditions. Captured images were validated using Pix4D Fields to generate orthomosaic maps and vegetation index maps, such as normalised difference vegetation index (NDVI) and normalised difference red-edge (NDRE) index. Our dataset is one of the few datasets that provide high-resolution images with rich metadata that cover all growth stages of Indian paddy crops. The dataset is available on IEEE DataPort with DOI, . It can support studies on targeted spraying, disease analysis, and yield estimation.

[161] Application of deep learning techniques in non-contrast computed tomography pulmonary angiogram for pulmonary embolism diagnosis

I-Hsien Ting, Yi-Jun Tseng, Yu-Sheng Lin

Main category: cs.CV

TL;DR: Deep learning model achieves 85% accuracy in classifying pulmonary embolism from non-contrast CT images, addressing risks of contrast medium in kidney disease patients.

Details

Motivation: Contrast medium in CT pulmonary angiography can cause acute kidney injury in patients with pulmonary embolism and chronic kidney disease, and the time required for contrast administration may delay treatment in acute cases. There's a need for accurate pulmonary embolism diagnosis without contrast medium.

Method: Used a 3D convolutional neural network model to automatically classify pulmonary embolism in CT images without contrast medium.

Result: The model achieved 85% accuracy and 0.84 AUC in classifying pulmonary embolism from non-contrast CT images, demonstrating significant impact and feasibility.

Conclusion: Deep learning techniques can effectively diagnose pulmonary embolism from non-contrast CT images, potentially avoiding contrast-related complications and treatment delays while maintaining diagnostic accuracy.

Abstract: Pulmonary embolism is a life-threatening disease, early detection and treatment can significantly reduce mortality. In recent years, many studies have been using deep learning in the diagnosis of pulmonary embolism with contrast medium computed tomography pulmonary angiography, but the contrast medium is likely to cause acute kidney injury in patients with pulmonary embolism and chronic kidney disease, and the contrast medium takes time to work, patients with acute pulmonary embolism may miss the golden treatment time. This study aims to use deep learning techniques to automatically classify pulmonary embolism in CT images without contrast medium by using a 3D convolutional neural network model. The deep learning model used in this study had a significant impact on the pulmonary embolism classification of computed tomography images without contrast with 85% accuracy and 0.84 AUC, which confirms the feasibility of the model in the diagnosis of pulmonary embolism.

[162] Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization

Abhinav Attri, Rajeev Ranjan Dwivedi, Samiran Das, Vinod Kumar Kurmi

Main category: cs.CV

TL;DR: HAQAGen is a unified generative model for resolution-invariant NIR-to-RGB colorization that balances chromatic realism with structural fidelity using histogram matching, SPADE priors, and Mamba backbone with texture-aware supervision.

Details

Motivation: Existing NIR-to-RGB translation methods struggle to balance chromatic realism with structural fidelity while maintaining resolution invariance. There's a need for a model that can simultaneously enforce global color statistics and local chromatic consistency while scaling to native resolutions without compromising texture fidelity.

Method: The model introduces: (1) combined loss with differentiable histogram matching, perceptual quality measures, and feature similarity; (2) local hue-saturation priors via SPADE for chromatic stabilization; (3) texture-aware supervision within a Mamba backbone; and (4) adaptive-resolution inference engine for high-resolution translation.

Result: Extensive evaluations on FANVID, OMSIV, VCIP2020, and RGB2NIR datasets show consistent improvements over state-of-the-art baselines. HAQAGen produces images with sharper textures and natural colors, achieving significant gains in perceptual metrics.

Conclusion: HAQAGen is positioned as a scalable and effective solution for NIR-to-RGB translation across diverse imaging scenarios, successfully balancing chromatic realism with structural fidelity while maintaining resolution invariance.

Abstract: We present HAQAGen, a unified generative model for resolution-invariant NIR-to-RGB colorization that balances chromatic realism with structural fidelity. The proposed model introduces (i) a combined loss term aligning the global color statistics through differentiable histogram matching, perceptual image quality measure, and feature based similarity to preserve texture information, (ii) local hue-saturation priors injected via Spatially Adaptive Denormalization (SPADE) to stabilize chromatic reconstruction, and (iii) texture-aware supervision within a Mamba backbone to preserve fine details. We introduce an adaptive-resolution inference engine that further enables high-resolution translation without sacrificing quality. Our proposed NIR-to-RGB translation model simultaneously enforces global color statistics and local chromatic consistency, while scaling to native resolutions without compromising texture fidelity or generalization. Extensive evaluations on FANVID, OMSIV, VCIP2020, and RGB2NIR using different evaluation metrics demonstrate consistent improvements over state-of-the-art baseline methods. HAQAGen produces images with sharper textures, natural colors, attaining significant gains as per perceptual metrics. These results position HAQAGen as a scalable and effective solution for NIR-to-RGB translation across diverse imaging scenarios. Project Page: https://rajeev-dw9.github.io/HAQAGen/

[163] Analyzing the Shopping Journey: Computing Shelf Browsing Visits in a Physical Retail Store

Luis Yoichi Morales, Francesco Zanlungo, David M. Woollard

Main category: cs.CV

TL;DR: Algorithm for detecting shopper browsing behavior (“shelf visits”) from 3D tracking data, validated across different retail stores, enabling analysis of browsing patterns and purchase relationships.

Details

Motivation: To enable autonomous understanding of shopper intent in retail environments, particularly for customer-facing robot deployment and retail planning.

Method: Developed algorithm to compute “shelf visits” from machine vision-based 3D tracking data using overhead cameras. Calibrated models on two independent trajectory datasets (8138 and 15129 trajectories) from different stores with human-labeled ground truth.

Result: Algorithm successfully recognized customer browsing activity across different store environments. Model was used to analyze browsing patterns and their relationship to actual purchases on large trajectory datasets.

Conclusion: Shelf browsing information has practical applications for retail planning and human-robot interaction scenarios in customer-facing retail environments.

Abstract: Motivated by recent challenges in the deployment of robots into customer-facing roles within retail, this work introduces a study of customer activity in physical stores as a step toward autonomous understanding of shopper intent. We introduce an algorithm that computes shoppers’ shelf visits'' -- capturing their browsing behavior in the store. Shelf visits are extracted from trajectories obtained via machine vision-based 3D tracking and overhead cameras. We perform two independent calibrations of the shelf visit algorithm, using distinct sets of trajectories (consisting of 8138 and 15129 trajectories), collected in different stores and labeled by human reviewers. The calibrated models are then evaluated on trajectories held out of the calibration process both from the same store on which calibration was performed and from the other store. An analysis of the results shows that the algorithm can recognize customers' browsing activity when evaluated in an environment different from the one on which calibration was performed. We then use the model to analyze the customers' browsing patterns’’ on a large set of trajectories and their relation to actual purchases in the stores. Finally, we discuss how shelf browsing information could be used for retail planning and in the domain of human-robot interaction scenarios.

[164] MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

Main category: cs.CV

TL;DR: MS-ISSM is a novel point cloud quality assessment method that uses implicit functions for structural similarity measurement and a hierarchical network for quality prediction.

Details

Motivation: The unstructured and irregular nature of point clouds makes objective quality assessment challenging, especially in establishing accurate perceptual feature correspondence between reference and distorted point clouds.

Method: Proposes Multi-scale Implicit Structural Similarity Measurement (MS-ISSM) using Radial Basis Functions to represent local features continuously, transforming distortion measurement into comparison of implicit function coefficients. Also introduces ResGrouped-MLP network with grouped encoding strategy integrated with Residual Blocks and Channel-wise Attention mechanisms.

Result: Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization.

Conclusion: MS-ISSM effectively addresses the challenges of point cloud quality assessment by avoiding matching errors in irregular data and adaptively focusing on salient distortion features across different scales.

Abstract: The unstructured and irregular nature of point clouds poses a significant challenge for objective quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes Radial Basis Functions (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat MLPs by adopting a grouped encoding strategy integrated with Residual Blocks and Channel-wise Attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

[165] ShadowGS: Shadow-Aware 3D Gaussian Splatting for Satellite Imagery

Feng Luo, Hongbo Pan, Xiang Yang, Baoyu Jiang, Fengqing Liu, Tao Huang

Main category: cs.CV

TL;DR: ShadowGS: A 3D Gaussian Splatting framework for satellite imagery that models consistent shadows across multi-temporal images using physics-based rendering and ray marching, improving 3D reconstruction accuracy and shadow decoupling.

Details

Motivation: In multi-temporal satellite imagery, shadows exhibit significant inconsistencies due to varying illumination conditions, which poses challenges for accurate 3D reconstruction and shadow modeling.

Method: Proposes ShadowGS framework based on 3D Gaussian Splatting that uses: 1) physics-based rendering equation from remote sensing, 2) efficient ray marching technique, 3) shadow consistency constraint for geometric accuracy, and 4) shadow map prior for sparse-view inputs.

Result: Outperforms state-of-the-art methods in shadow decoupling accuracy, 3D reconstruction precision, and novel view synthesis quality with only minutes of training. Shows robust performance across RGB, pansharpened, and sparse-view satellite inputs.

Conclusion: ShadowGS effectively addresses shadow inconsistencies in multi-temporal satellite imagery, enabling precise shadow modeling and improved 3D reconstruction while maintaining efficient rendering capabilities.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a novel paradigm for 3D reconstruction from satellite imagery. However, in multi-temporal satellite images, prevalent shadows exhibit significant inconsistencies due to varying illumination conditions. To address this, we propose ShadowGS, a novel framework based on 3DGS. It leverages a physics-based rendering equation from remote sensing, combined with an efficient ray marching technique, to precisely model geometrically consistent shadows while maintaining efficient rendering. Additionally, it effectively disentangles different illumination components and apparent attributes in the scene. Furthermore, we introduce a shadow consistency constraint that significantly enhances the geometric accuracy of 3D reconstruction. We also incorporate a novel shadow map prior to improve performance with sparse-view inputs. Extensive experiments demonstrate that ShadowGS outperforms current state-of-the-art methods in shadow decoupling accuracy, 3D reconstruction precision, and novel view synthesis quality, with only a few minutes of training. ShadowGS exhibits robust performance across various settings, including RGB, pansharpened, and sparse-view satellite inputs.

[166] Learning to Segment Liquids in Real-world Images

Jonas Li, Michelle Li, Luke Liu, Heng Fan

Main category: cs.CV

TL;DR: A new large-scale dataset (LQDS) and detection model (LQDM) for liquid segmentation, addressing the challenging task of detecting diverse liquid appearances in real-world images.

Details

Motivation: Liquids are ubiquitous in daily life (water, wine, medicine) but limited research exists on liquid segmentation, hindering robots' ability to safely interact with or avoid liquids. The task is difficult due to liquids' diverse appearances, transparency, reflectivity, and ability to take on background characteristics.

Method: Created LQDS dataset with 5000 real-world images annotated into 14 liquid classes. Designed LQDM model using cross-attention between a dedicated boundary branch and main segmentation branch to enhance predictions.

Result: Extensive experiments show LQDM outperforms state-of-the-art methods on the LQDS test set, establishing a strong baseline for liquid semantic segmentation.

Conclusion: The paper presents a comprehensive solution for liquid segmentation through both dataset creation and model innovation, addressing a previously under-explored but important computer vision task with practical applications in robotics.

Abstract: Different types of liquids such as water, wine and medicine appear in all aspects of daily life. However, limited attention has been given to the task, hindering the ability of robots to avoid or interact with liquids safely. The segmentation of liquids is difficult because liquids come in diverse appearances and shapes; moreover, they can be both transparent or reflective, taking on arbitrary objects and scenes from the background or surroundings. To take on this challenge, we construct a large-scale dataset of liquids named LQDS consisting of 5000 real-world images annotated into 14 distinct classes, and design a novel liquid detection model named LQDM, which leverages cross-attention between a dedicated boundary branch and the main segmentation branch to enhance segmentation predictions. Extensive experiments demonstrate the effectiveness of LQDM on the test set of LQDS, outperforming state-of-the-art methods and establishing a strong baseline for the semantic segmentation of liquids.

[167] PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education

Megha Mariam K. M, Aditya Arun, Zakaria Laskar, C. V. Jawahar

Main category: cs.CV

TL;DR: Researchers introduce a benchmark to evaluate Text-to-Video (T2V) models for generating educational physics videos, finding they produce visually coherent content but struggle with conceptual accuracy, especially for abstract topics like electromagnetism.

Details

Motivation: To systematically assess the potential of generative AI models, particularly T2V systems, for transforming science education by automating the creation of engaging visual explanations for physics concepts.

Method: Created a dedicated benchmark for physics education video generation, decomposing physics concepts into granular teaching points with carefully crafted prompts. T2V models are evaluated on their ability to generate accurate videos in response to these prompts.

Result: Current T2V models produce visually coherent videos with smooth motion and minimal flickering, but conceptual accuracy is less reliable. Performance is encouraging for mechanics, fluids, and optics, but models struggle with electromagnetism and thermodynamics where abstract interactions are harder to depict.

Conclusion: There’s a significant gap between visual quality and conceptual correctness in educational video generation. The benchmark aims to help close this gap and move toward T2V systems that can deliver accurate, curriculum-aligned physics content at scale for scalable, accessible, and personalized learning.

Abstract: Generative AI models, particularly Text-to-Video (T2V) systems, offer a promising avenue for transforming science education by automating the creation of engaging and intuitive visual explanations. In this work, we take a first step toward evaluating their potential in physics education by introducing a dedicated benchmark for explanatory video generation. The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations. Each physics concept in our benchmark is decomposed into granular teaching points, with each point accompanied by a carefully crafted prompt intended for visual explanation of the teaching point. T2V models are evaluated on their ability to generate accurate videos in response to these prompts. Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content-paving the way toward scalable, accessible, and personalized learning experiences powered by AI. Our evaluation reveals that current models produce visually coherent videos with smooth motion and minimal flickering, yet their conceptual accuracy is less reliable. Performance in areas such as mechanics, fluids, and optics is encouraging, but models struggle with electromagnetism and thermodynamics, where abstract interactions are harder to depict. These findings underscore the gap between visual quality and conceptual correctness in educational video generation. We hope this benchmark helps the community close that gap and move toward T2V systems that can deliver accurate, curriculum-aligned physics content at scale. The benchmark and accompanying codebase are publicly available at https://github.com/meghamariamkm/PhyEduVideo.

[168] Deep Clustering with Associative Memories

Bishwajit Saha, Dmitry Krotov, Mohammed J. Zaki, Parikshit Ram

Main category: cs.CV

TL;DR: DCAM is a novel deep clustering method that uses energy-based dynamics via Associative Memories to better integrate representation learning and clustering in a single objective.

Details

Motivation: Current deep clustering methods treat representation learning and clustering as somewhat disjointed processes because clustering is inherently discrete while representation learning is differentiable, requiring approximations that disconnect the two components.

Method: Proposes DCAM (Deep Clustering via Associative Memories), a novel loss function using energy-based dynamics via Associative Memories to formulate a unified deep clustering method that intricately ties representation learning and clustering in a single objective.

Result: DCAM produces improved clustering quality across various architecture choices (convolutional, residual, fully-connected) and data modalities (images and text).

Conclusion: The proposed DCAM method successfully integrates representation learning and clustering more effectively than previous approaches, demonstrating superior performance across diverse architectures and data types.

Abstract: Deep clustering - joint representation learning and latent space clustering - is a well studied problem especially in computer vision and text processing under the deep learning framework. While the representation learning is generally differentiable, clustering is an inherently discrete optimization task, requiring various approximations and regularizations to fit in a standard differentiable pipeline. This leads to a somewhat disjointed representation learning and clustering. In this work, we propose a novel loss function utilizing energy-based dynamics via Associative Memories to formulate a new deep clustering method, DCAM, which ties together the representation learning and clustering aspects more intricately in a single objective. Our experiments showcase the advantage of DCAM, producing improved clustering quality for various architecture choices (convolutional, residual or fully-connected) and data modalities (images or text).

[169] A Deep Learning Approach for Automated Skin Lesion Diagnosis with Explainable AI

Md. Maksudul Haque, Rahnuma Akter, A S M Ahsanul Sarkar Akib, Abdul Hasib

Main category: cs.CV

TL;DR: Deep learning system for multi-class skin lesion classification achieves 91.15% accuracy using EfficientNetV2-L with attention mechanisms, data balancing, augmentation, and progressive learning, enhanced by explainable AI techniques.

Details

Motivation: Skin cancer requires timely and precise diagnosis due to its prevalence and danger. There's a need for accurate automated classification systems that can assist in clinical diagnosis while maintaining transparency.

Method: Combines data balancing methods, large-scale data augmentation, hybridized EfficientNetV2-L framework with channel attention, and three-stage progressive learning approach. Uses explainable AI techniques (Grad-CAM and saliency maps) for visual interpretability.

Result: Achieved 91.15% accuracy, 85.45% macro F1 score, and 99.33% micro-average AUC on HAM10000 dataset. Performed well across all seven lesion classes, with particularly high performance on melanoma and melanocytic nevi.

Conclusion: The proposed deep learning system provides accurate skin lesion classification while enhancing diagnostic transparency through explainable AI, improving clinical trustworthiness by revealing visual characteristics driving classifications.

Abstract: Skin cancer is also one of the most common and dangerous types of cancer in the world that requires timely and precise diagnosis. In this paper, a deep-learning architecture of the multi-class skin lesion classification on the HAM10000 dataset will be described. The system suggested combines high-quality data balancing methods, large-scale data augmentation, hybridized EfficientNetV2-L framework with channel attention, and a three-stage progressive learning approach. Moreover, we also use explainable AI (XAI) techniques such as Grad-CAM and saliency maps to come up with intelligible visual representations of model predictions. Our strategy is with a total accuracy of 91.15 per cent, macro F1 of 85.45% and micro-average AUC of 99.33%. The model has shown high performance in all the seven lesion classes with specific high performance of melanoma and melanocytic nevi. In addition to enhancing diagnostic transparency, XAI also helps to find out the visual characteristics that cause the classifications, which enhances clinical trustworthiness.

[170] Few-Shot Video Object Segmentation in X-Ray Angiography Using Local Matching and Spatio-Temporal Consistency Loss

Lin Xi, Yingliang Ma, Xiahai Zhuang

Main category: cs.CV

TL;DR: A novel FSVOS model with local matching strategy and direction-based sampling for efficient video segmentation, plus supervised spatio-temporal contrastive learning and a new X-ray angiography dataset.

Details

Motivation: Existing video segmentation methods suffer from inefficient implementations (im2col-like operations, hardware-specific CUDA kernels) with limited portability across non-CUDA devices. There's a need for more flexible, efficient approaches that can adapt to diverse spatial structures without computational costs of parametric layers or model retraining.

Method: 1) Local matching strategy to restrict search space to relevant neighboring pixels; 2) Direction-based sampling perspective reorganizing local sampling process; 3) Non-parametric sampling mechanism enabling dynamically varying sampling regions; 4) Supervised spatio-temporal contrastive learning scheme for feature coherence across frames; 5) Introduction of MOSXAV benchmark dataset for multi-object segmentation in X-ray angiography videos.

Result: Extensive experiments on CADICA, XACV, and MOSXAV datasets show the proposed FSVOS method outperforms current state-of-the-art video segmentation methods in segmentation accuracy and generalization capability (both seen and unseen categories).

Conclusion: The work offers enhanced flexibility and potential for a wide range of clinical applications through an efficient, portable video segmentation approach with improved accuracy and generalization.

Abstract: We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col-like implementations (e.g., spatial convolutions, depthwise convolutions and feature-shifting mechanisms) or hardware-specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non-CUDA devices, we reorganize the local sampling process through a direction-based sampling perspective. Specifically, we implement a non-parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio-temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi-object segmentation in X-ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state-of-the-art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications.

[171] UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data

Joshua Kawaguchi, Saad Manzur, Emily Gao Wang, Maitreyi Sinha, Bryan Vela, Yunxi Wang, Brandon Vela, Wayne B. Hayes

Main category: cs.CV

TL;DR: UnrealPose-Gen is an Unreal Engine 5 pipeline for generating synthetic 3D human pose data, used to create UnrealPose-1M dataset with 1M frames including 3D joints, 2D projections, bounding boxes, and camera parameters.

Details

Motivation: Real-world 3D human pose datasets are expensive, studio-bound, and lack ground truth, while in-the-wild datasets don't have known ground truth. There's a need for diverse, accurately labeled synthetic data.

Method: Built UnrealPose-Gen pipeline using Unreal Engine 5 and Movie Render Queue for high-quality offline rendering. Generated comprehensive annotations including 3D joints (world/camera coordinates), 2D projections with occlusion flags, bounding boxes, and camera parameters.

Result: Created UnrealPose-1M dataset with ~1M frames across 8 sequences (5 coherent, 3 randomized), 5 scenes, ~140 actions total, 5 subjects, diverse camera trajectories. Validated with real-to-synthetic experiments on 4 tasks showing pipeline fidelity.

Conclusion: The UnrealPose-Gen pipeline enables generation of high-quality synthetic human pose data with accurate ground truth. Both the pipeline and UnrealPose-1M dataset are released to support third-party data generation and research.

Abstract: Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth. We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering. Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-style keypoints with occlusion and joint-visibility flags, (iii) person bounding boxes, and (iv) camera intrinsics and extrinsics. We use UnrealPose-Gen to present UnrealPose-1M, an approximately one million frame corpus comprising eight sequences: five scripted “coherent” sequences spanning five scenes, approximately 40 actions, and five subjects; and three randomized sequences across three scenes, approximately 100 actions, and five subjects, all captured from diverse camera trajectories for broad viewpoint coverage. As a fidelity check, we report real-to-synthetic results on four tasks: image-to-3D pose, 2D keypoint detection, 2D-to-3D lifting, and person detection/segmentation. Though time and resources constrain us from an unlimited dataset, we release the UnrealPose-1M dataset, as well as the UnrealPose-Gen pipeline to support third-party generation of human pose data.

[172] WildIng: A Wildlife Image Invariant Representation Model for Geographical Domain Shift

Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo

Main category: cs.CV

TL;DR: WildIng is a wildlife image representation model that improves generalization across geographical domains by integrating text descriptions with image features, addressing performance drops when models trained in one region are tested in another.

Details

Motivation: Current deep learning models for wildlife monitoring struggle with geographical domain shifts - they perform well on data from training regions but fail to generalize to new geographical areas due to sensitivity to background, lighting, and environmental variations.

Method: WildIng integrates text descriptions with image features to create more robust representations. By leveraging textual descriptions of species appearance, it captures consistent semantic information that improves generalization across different geographical locations.

Result: WildIng enhances accuracy of foundation models like BioCLIP by 30% under geographical domain shift conditions. It was evaluated on datasets from America and Africa, showing significant improvement over models that drop from 84.77% to 16.17% accuracy when tested across regions.

Conclusion: Integrating text descriptions with image features creates more robust wildlife representations that better handle geographical domain shifts, enabling more reliable automated wildlife monitoring across different regions.

Abstract: Wildlife monitoring is crucial for studying biodiversity loss and climate change. Camera trap images provide a non-intrusive method for analyzing animal populations and identifying ecological patterns over time. However, manual analysis is time-consuming and resource-intensive. Deep learning, particularly foundation models, has been applied to automate wildlife identification, achieving strong performance when tested on data from the same geographical locations as their training sets. Yet, despite their promise, these models struggle to generalize to new geographical areas, leading to significant performance drops. For example, training an advanced vision-language model, such as CLIP with an adapter, on an African dataset achieves an accuracy of 84.77%. However, this performance drops significantly to 16.17% when the model is tested on an American dataset. This limitation partly arises because existing models rely predominantly on image-based representations, making them sensitive to geographical data distribution shifts, such as variation in background, lighting, and environmental conditions. To address this, we introduce WildIng, a Wildlife image Invariant representation model for geographical domain shift. WildIng integrates text descriptions with image features, creating a more robust representation to geographical domain shifts. By leveraging textual descriptions, our approach captures consistent semantic information, such as detailed descriptions of the appearance of the species, improving generalization across different geographical locations. Experiments show that WildIng enhances the accuracy of foundation models such as BioCLIP by 30% under geographical domain shift conditions. We evaluate WildIng on two datasets collected from different regions, namely America and Africa. The code and models are publicly available at https://github.com/Julian075/CATALOG/tree/WildIng.

[173] DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

Yue Zhou, Jue Chen, Zilun Zhang, Penghui Huang, Ran Ding, Zhentao Zou, PengFei Gao, Yuchen Wei, Ke Li, Xue Yang, Xue Jiang, Hongxin Yang, Jonathan Li

Main category: cs.CV

TL;DR: DVGBench is a new drone vision-language benchmark for implicit visual grounding tasks, with DroneVG-R1 model using Implicit-to-Explicit Chain-of-Thought to convert implicit references to explicit ones.

Details

Motivation: Existing remote sensing visual grounding datasets rely too heavily on explicit referring expressions (position, size, color), limiting performance on implicit tasks requiring domain-specific knowledge about drone scenarios.

Method: Created DVGBench dataset covering six drone application scenarios with both explicit and implicit queries. Developed DroneVG-R1 model integrating Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within reinforcement learning to convert implicit references to explicit ones.

Result: Evaluation shows mainstream models have substantial limitations in reasoning capabilities for implicit visual grounding tasks. The benchmark provides actionable insights for improving LVLM reasoning for drone-based agents.

Conclusion: DVGBench addresses the gap in implicit visual grounding for drone applications, and the I2E-CoT approach helps models leverage scene-specific expertise to reduce grounding difficulty for implicit references.

Abstract: Remote sensing (RS) large vision-language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions-such as relative position, relative size, and color cues-thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces DVGBench, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents. The code and datasets will be released at https://github.com/zytx121/DVGBench

[174] Lightweight Channel Attention for Efficient CNNs

Prem Babu Kanaparthi, Tulasi Venkata Sri Varshini Padamata

Main category: cs.CV

TL;DR: Empirical study compares SE, ECA, and proposed LCA attention modules on ResNet18/MobileNetV2 for CIFAR-10, showing LCA achieves competitive accuracy with parameter efficiency.

Details

Motivation: The efficiency-accuracy trade-off of different channel attention designs in CNNs remains underexplored, despite attention mechanisms delivering performance improvements with minimal computational overhead.

Method: Proposes Lite Channel Attention (LCA) module using adaptive 1D convolutions with grouped operations to reduce parameters while preserving effective attention behavior. Compares SE, ECA, and LCA across ResNet18 and MobileNetV2 architectures on CIFAR-10.

Result: LCA achieves competitive accuracy: 94.68% on ResNet18 and 93.10% on MobileNetV2, matching ECA in parameter efficiency while maintaining favorable inference latency. Comprehensive benchmarks provided for FLOPs, parameters, and GPU latency.

Conclusion: LCA offers practical insights for deploying attention-enhanced CNNs in resource-constrained environments by balancing accuracy and efficiency through adaptive grouped convolutions.

Abstract: Attention mechanisms have become integral to modern convolutional neural networks (CNNs), delivering notable performance improvements with minimal computational overhead. However, the efficiency accuracy trade off of different channel attention designs remains underexplored. This work presents an empirical study comparing Squeeze and Excitation (SE), Efficient Channel Attention (ECA), and a proposed Lite Channel Attention (LCA) module across ResNet 18 and MobileNetV2 architectures on CIFAR 10. LCA employs adaptive one dimensional convolutions with grouped operations to reduce parameter usage while preserving effective attention behavior. Experimental results show that LCA achieves competitive accuracy, reaching 94.68 percent on ResNet 18 and 93.10 percent on MobileNetV2, while matching ECA in parameter efficiency and maintaining favorable inference latency. Comprehensive benchmarks including FLOPs, parameter counts, and GPU latency measurements are provided, offering practical insights for deploying attention enhanced CNNs in resource constrained environments.

[175] Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking

Shiao Wang, Xiao Wang, Haonan Zhao, Jiarui Xu, Bo Jiang, Lin Zhu, Xin Zhao, Yonghong Tian, Jin Tang

Main category: cs.CV

TL;DR: A novel RGB-Event tracking framework using frequency-domain early fusion and motion-guided spatial sparsification to efficiently exploit event camera advantages while reducing computational overhead.

Details

Motivation: Existing RGB-Event tracking methods fail to fully utilize event camera advantages like high dynamic range and motion sensitivity, while uniformly processing low-information regions leads to unnecessary computational overhead for backbone networks.

Method: 1) Frequency-domain early fusion: Transform RGB and event modalities to frequency domain via FFT, decouple amplitude/phase components, selectively fuse high-frequency event information through amplitude and phase attention. 2) Motion-guided spatial sparsification: Leverage event camera motion sensitivity to capture relationship between target motion cues and spatial probability distribution, filter low-information regions. 3) Feed sparse target-relevant features to backbone network for learning, with tracking head predicting final position.

Result: Extensive experiments on three RGB-Event tracking benchmarks (FE108, FELT, COESOT) demonstrate high performance and efficiency of the proposed method.

Conclusion: The proposed framework effectively addresses limitations of existing RGB-Event tracking approaches by exploiting event camera advantages through frequency-domain fusion and motion-guided sparsification, achieving both high performance and computational efficiency.

Abstract: Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking

[176] ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval

Tien-Huy Nguyen, Huu-Loc Tran, Thanh Duc Ngo

Main category: cs.CV

TL;DR: ITSELF is an attention-guided framework for text-based person search that uses the model’s own attention to create an Attentive Bank of high-saliency tokens for implicit local alignment, avoiding shortcut learning and spurious correlations without extra supervision.

Details

Motivation: Previous methods for text-based person search suffer from shortcut learning, spurious correlations, and misalignment issues with local alignment approaches. Injecting prior knowledge can distort intra-modality structure. The authors found that encoder attention surfaces spatially precise evidence early in training, motivating an attention-guided approach.

Method: ITSELF consists of three main components: 1) Guided Representation with Attentive Bank (GRAB) converts model attention into an Attentive Bank of high-saliency tokens for local objectives, 2) Multi-Layer Attention for Robust Selection (MARS) aggregates attention across layers with diversity-aware top-k selection, and 3) Adaptive Token Scheduler (ATS) schedules retention budget from coarse to fine over training.

Result: Extensive experiments on three widely used TBPS benchmarks show state-of-the-art performance and strong cross-dataset generalization, confirming effectiveness and robustness without additional prior supervision.

Conclusion: The attention-guided framework ITSELF effectively addresses limitations of previous local alignment methods by leveraging the model’s own attention for implicit fine-grained correspondence learning, achieving superior performance in text-based person search tasks.

Abstract: Vision Language Models (VLMs) have rapidly advanced and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our finding that encoder attention surfaces spatially precise evidence from the earliest training epochs, and to alleviate these issues, we introduceITSELF, an attention-guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model’s own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks showstate-of-the-art performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision. Our project is publicly available at https://trhuuloc.github.io/itself

[177] Enhanced Leukemic Cell Classification Using Attention-Based CNN and Data Augmentation

Douglas Costa Braga, Daniel Oliveira Dantas

Main category: cs.CV

TL;DR: A reproducible deep learning pipeline for leukemic cell classification achieves 97.89% accuracy using attention-based CNN with EfficientNetV2-B3 and SE mechanisms, outperforming baselines with 89% fewer parameters than VGG16.

Details

Motivation: Acute lymphoblastic leukemia (ALL) is the most common childhood cancer requiring microscopic diagnosis, which suffers from inter-observer variability and time constraints, necessitating automated classification systems.

Method: Integrates attention-based CNN combining EfficientNetV2-B3 with Squeeze-and-Excitation mechanisms, uses comprehensive data augmentation, focal loss for class imbalance, and patient-wise data splitting for robust evaluation.

Result: Achieves 97.89% F1-score and accuracy on C-NMC 2019 dataset (12,528 images), with statistical validation through 100-iteration Monte Carlo experiments showing significant improvements (p < 0.001) over baselines, using 89% fewer parameters than VGG16.

Conclusion: Modern attention-based architectures improve leukemic cell classification while maintaining computational efficiency suitable for clinical deployment, with interpretable visualizations of diagnostically relevant cellular features.

Abstract: We present a reproducible deep learning pipeline for leukemic cell classification, focusing on system architecture, experimental robustness, and software design choices for medical image analysis. Acute lymphoblastic leukemia (ALL) is the most common childhood cancer, requiring expert microscopic diagnosis that suffers from inter-observer variability and time constraints. The proposed system integrates an attention-based convolutional neural network combining EfficientNetV2-B3 with Squeeze-and-Excitation mechanisms for automated ALL cell classification. Our approach employs comprehensive data augmentation, focal loss for class imbalance, and patient-wise data splitting to ensure robust and reproducible evaluation. On the C-NMC 2019 dataset (12,528 original images from 62 patients), the system achieves a 97.89% F1-score and 97.89% accuracy on the test set, with statistical validation through 100-iteration Monte Carlo experiments confirming significant improvements (p < 0.001) over baseline methods. The proposed pipeline outperforms existing approaches by up to 4.67% while using 89% fewer parameters than VGG16 (15.2M vs. 138M). The attention mechanism provides interpretable visualizations of diagnostically relevant cellular features, demonstrating that modern attention-based architectures can improve leukemic cell classification while maintaining computational efficiency suitable for clinical deployment.

[178] Mono3DV: Monocular 3D Object Detection with 3D-Aware Bipartite Matching and Variational Query DeNoising

Kiet Dang Vu, Trung Thai Tran, Kien Nguyen Do Trung, Duc Dung Nguyen

Main category: cs.CV

TL;DR: Mono3DV is a Transformer-based framework for monocular 3D object detection that addresses limitations of DETR-like architectures by incorporating 3D geometric information into bipartite matching and stabilizing training through novel denoising techniques.

Details

Motivation: DETR-like architectures for monocular 3D object detection suffer from excluding 3D attributes from bipartite matching due to the ill-posed nature of 3D estimation from monocular images. This causes instability during training and suppresses high-quality 3D predictions using only 2D matching criteria.

Method: Three key innovations: 1) 3D-Aware Bipartite Matching that incorporates 3D geometric information into matching cost, 2) 3D-DeNoising scheme to stabilize training when integrating 3D attributes, and 3) Variational Query DeNoising mechanism to overcome gradient vanishing issues of conventional denoising techniques.

Result: Achieves state-of-the-art results on the KITTI 3D object detection benchmark without using any external data.

Conclusion: Mono3DV successfully addresses the limitations of existing DETR-like architectures for monocular 3D detection by properly integrating 3D information into the matching process and stabilizing training, leading to superior performance.

Abstract: While DETR-like architectures have demonstrated significant potential for monocular 3D object detection, they are often hindered by a critical limitation: the exclusion of 3D attributes from the bipartite matching process. This exclusion arises from the inherent ill-posed nature of 3D estimation from monocular image, which introduces instability during training. Consequently, high-quality 3D predictions can be erroneously suppressed by 2D-only matching criteria, leading to suboptimal results. To address this, we propose Mono3DV, a novel Transformer-based framework. Our approach introduces three key innovations. First, we develop a 3D-Aware Bipartite Matching strategy that directly incorporates 3D geometric information into the matching cost, resolving the misalignment caused by purely 2D criteria. Second, it is important to stabilize the Bipartite Matching to resolve the instability occurring when integrating 3D attributes. Therefore, we propose 3D-DeNoising scheme in the training phase. Finally, recognizing the gradient vanishing issue associated with conventional denoising techniques, we propose a novel Variational Query DeNoising mechanism to overcome this limitation, which significantly enhances model performance. Without leveraging any external data, our method achieves state-of-the-art results on the KITTI 3D object detection benchmark.

[179] Evaluating transfer learning strategies for improving dairy cattle body weight prediction in small farms using depth-image and point-cloud data

Jin Wang, Angelo De Castro, Yuxi Zhang, Lucas Basolli Borsatto, Yuechen Guo, Victoria Bastos Primo, Ana Beatriz Montevecchio Bernardino, Gota Morota, Ricardo C Chebel, Haipeng Yu

Main category: cs.CV

TL;DR: Transfer learning from large farms significantly improves body weight prediction on small farms with limited data, outperforming single-source learning and matching joint learning performance. No consistent difference found between depth-image and point-cloud models.

Details

Motivation: Computer vision offers automated monitoring tools for dairy cattle, but transfer learning effectiveness and optimal fine-tuning strategies remain poorly understood in livestock applications. There's also limited direct comparison between depth-image and point-cloud modalities for body weight prediction.

Method: Collected top-view depth images and point-cloud data from 1,201, 215, and 58 cows at large, medium, and small farms. Evaluated four deep learning models: ConvNeXt and MobileViT for depth images, and PointNet and DGCNN for point clouds. Tested transfer learning from large farm to small farm under three experimental designs.

Result: Transfer learning markedly improved body weight prediction on the small farm across all four models, outperforming single-source learning and achieving gains comparable to or greater than joint learning. No consistent performance difference observed between depth-image- and point-cloud-based models.

Conclusion: Transfer learning is well-suited for small farm prediction scenarios where cross-farm data sharing is limited by privacy, logistical, or policy constraints, as it requires access only to pretrained model weights rather than raw data. Pretrained representations generalize well across farms with differing imaging conditions and cattle populations.

Abstract: Computer vision provides automated, non-invasive, and scalable tools for monitoring dairy cattle, thereby supporting management, health assessment, and phenotypic data collection. Although transfer learning is commonly used for predicting body weight from images, its effectiveness and optimal fine-tuning strategies remain poorly understood in livestock applications, particularly beyond the use of pretrained ImageNet or COCO weights. In addition, while both depth images and three-dimensional point-cloud data have been explored for body weight prediction, direct comparisons of these two modalities in dairy cattle are limited. Therefore, the objectives of this study were to 1) evaluate whether transfer learning from a large farm enhances body weight prediction on a small farm with limited data, and 2) compare the predictive performance of depth-image- and point-cloud-based approaches under three experimental designs. Top-view depth images and point-cloud data were collected from 1,201, 215, and 58 cows at large, medium, and small dairy farms, respectively. Four deep learning models were evaluated: ConvNeXt and MobileViT for depth images, and PointNet and DGCNN for point clouds. Transfer learning markedly improved body weight prediction on the small farm across all four models, outperforming single-source learning and achieving gains comparable to or greater than joint learning. These results indicate that pretrained representations generalize well across farms with differing imaging conditions and dairy cattle populations. No consistent performance difference was observed between depth-image- and point-cloud-based models. Overall, these findings suggest that transfer learning is well suited for small farm prediction scenarios where cross-farm data sharing is limited by privacy, logistical, or policy constraints, as it requires access only to pretrained model weights rather than raw data.

[180] EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Shuo Yang, Zheng Liu, Bo Zhao

Main category: cs.CV

TL;DR: EgoGrasp: First method to reconstruct world-space hand-object interactions from egocentric monocular videos with dynamic cameras in the wild.

Details

Motivation: Accurate world-space hand-object interaction reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and VR. Existing methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints, and suffer under severe camera motion and occlusions in egocentric videos.

Method: Multi-stage framework with: 1) Robust pre-process pipeline built on newly developed spatial intelligence models, 2) Whole-body HOI prior model based on decoupled diffusion models (template-free and scalable to multiple objects), 3) Multi-objective test-time optimization paradigm.

Result: Achieves state-of-the-art performance in world-space hand-object interaction reconstruction.

Conclusion: EgoGrasp successfully addresses the challenges of reconstructing world-space hand-object interactions from egocentric monocular videos with dynamic cameras, overcoming limitations of previous methods through its multi-stage framework and novel components.

Abstract: We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.

[181] Enhancing Histopathological Image Classification via Integrated HOG and Deep Features with Robust Noise Performance

Ifeanyi Ezuma, Ugochukwu Ugwu

Main category: cs.CV

TL;DR: Fine-tuned InceptionResNet-v2 achieves 96.01% accuracy on LC25000 histopathology dataset, but neural networks using extracted deep features reach near-perfect 99.84% accuracy and 99.99% AUC, showing superior performance and noise resilience.

Details

Motivation: Digital pathology advancements require automated image analysis for clinical practice, necessitating evaluation of machine learning and deep learning models for histopathological image classification.

Method: Used fine-tuned InceptionResNet-v2 network as both classifier and feature extractor on LC25000 dataset (5 classes). Evaluated models trained on deep features, compared performance, and tested robustness under varying SNR conditions.

Result: Fine-tuned InceptionResNet-v2 achieved 96.01% accuracy and 96.8% average AUC. Neural Network using deep features reached 99.84% accuracy and 99.99% AUC. Deep feature models showed greater noise resilience (GBM and KNN performed best), while HOG+deep feature combination improved performance but degraded in noisy environments.

Conclusion: Deep feature extraction from pre-trained networks significantly outperforms direct classification, achieving near-perfect accuracy on histopathological images. Models using deep features demonstrate superior robustness to noise, making them promising for clinical digital pathology applications.

Abstract: The era of digital pathology has advanced histopathological examinations, making automated image analysis essential in clinical practice. This study evaluates the classification performance of machine learning and deep learning models on the LC25000 dataset, which includes five classes of histopathological images. We used the fine-tuned InceptionResNet-v2 network both as a classifier and for feature extraction. Our results show that the fine-tuned InceptionResNet-v2 achieved a classification accuracy of 96.01% and an average AUC of 96.8%. Models trained on deep features from InceptionResNet-v2 outperformed those using only the pre-trained network, with the Neural Network model achieving an AUC of 99.99% and accuracy of 99.84%. Evaluating model robustness under varying SNR conditions revealed that models using deep features exhibited greater resilience, particularly GBM and KNN. The combination of HOG and deep features showed enhanced performance, however, less so in noisy environments.

[182] Luminark: Training-free, Probabilistically-Certified Watermarking for General Vision Generative Models

Jiayi Xu, Zhang Zhang, Yuanrui Zhang, Ruitao Chen, Yixian Xu, Tianyu He, Di He

Main category: cs.CV

TL;DR: Luminark is a training-free, probabilistically-certified watermarking method for vision generative models that uses patch-level luminance statistics for watermark detection and injection via guidance techniques.

Details

Motivation: There is a need for watermarking methods for vision generative models that are training-free, provide certified detection guarantees, maintain image quality, and work across different generative model paradigms without requiring model retraining.

Method: The method uses patch-level luminance statistics where a binary pattern with corresponding patch-level thresholds is predefined. Watermark detection checks if patch luminance surpasses thresholds and matches the target pattern. Watermark injection uses guidance techniques as a plug-and-play mechanism called “watermark guidance” that works across different generative models.

Result: The method achieves high detection accuracy, strong robustness against common image transformations, and good visual quality across nine different generative models spanning diffusion, autoregressive, and hybrid frameworks. Statistical analysis shows effective control of false positive rates.

Conclusion: Luminark provides a general, training-free watermarking solution for vision generative models with certified detection guarantees, maintaining image quality while working across diverse model architectures through its guidance-based injection approach.

Abstract: In this paper, we introduce \emph{Luminark}, a training-free and probabilistically-certified watermarking method for general vision generative models. Our approach is built upon a novel watermark definition that leverages patch-level luminance statistics. Specifically, the service provider predefines a binary pattern together with corresponding patch-level thresholds. To detect a watermark in a given image, we evaluate whether the luminance of each patch surpasses its threshold and then verify whether the resulting binary pattern aligns with the target one. A simple statistical analysis demonstrates that the false positive rate of the proposed method can be effectively controlled, thereby ensuring certified detection. To enable seamless watermark injection across different paradigms, we leverage the widely adopted guidance technique as a plug-and-play mechanism and develop the \emph{watermark guidance}. This design enables Luminark to achieve generality across state-of-the-art generative models without compromising image quality. Empirically, we evaluate our approach on nine models spanning diffusion, autoregressive, and hybrid frameworks. Across all evaluations, Luminark consistently demonstrates high detection accuracy, strong robustness against common image transformations, and good performance on visual quality.

[183] 600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script

Haq Nawaz Malik

Main category: cs.CV

TL;DR: A 600K synthetic OCR dataset for Kashmiri script with word-level segmented images, addressing resource scarcity for this endangered language.

Details

Motivation: Address the critical resource gap for Kashmiri, an endangered Dardic language with ~7 million speakers, which lacks adequate OCR training data despite using a modified Perso-Arabic writing system.

Method: Synthetic generation of ~602,000 word-level images (256x64 pixels) using three traditional Kashmiri typefaces, with comprehensive data augmentation simulating document degradation and diverse background textures for robustness.

Result: Created the 600K-KS-OCR Dataset distributed across ten partitioned archives (~10.6 GB total) with ground-truth transcriptions in multiple formats compatible with CRNN, TrOCR, and general ML pipelines.

Conclusion: The dataset released under CC-BY-4.0 license facilitates research in low-resource language OCR, providing a valuable resource for Kashmiri script recognition and preservation efforts.

Abstract: This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.

[184] NarrativeTrack: Evaluating Video Language Models Beyond the Frame

Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty

Main category: cs.CV

TL;DR: NarrativeTrack is the first benchmark for evaluating multimodal LLMs’ narrative understanding in videos through fine-grained entity-centric reasoning, revealing a trade-off between perceptual grounding and temporal coherence.

Details

Motivation: Current MLLMs show impressive vision-language reasoning but lack ability to understand temporally unfolding narratives in videos, which requires grounding entities (who, what, when, where) across dynamic visual and temporal contexts.

Method: Introduced NarrativeTrack benchmark with Compositional Reasoning Progression (CRP) framework that progressively evaluates narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. Uses automated entity-centric pipeline for scalable extraction of temporally grounded entity representations.

Result: MLLMs fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs show strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context but hallucinate entity contexts.

Conclusion: Reveals fundamental trade-off between perceptual grounding and temporal reasoning; narrative understanding emerges only from their integration. NarrativeTrack provides first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.

Abstract: Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entity’s contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.

[185] Evolving CNN Architectures: From Custom Designs to Deep Residual Models for Diverse Image Classification and Detection Tasks

Mahmudul Hasan, Mabsur Fatin Bin Hossain

Main category: cs.CV

TL;DR: Custom CNN vs pretrained/transfer learning models compared across 5 real-world image datasets, showing deeper networks excel at fine-grained multiclass tasks while lightweight models work well for binary classification.

Details

Motivation: To provide practical guidance for selecting appropriate CNN architectures based on task complexity and resource constraints by systematically comparing custom CNN designs with pretrained and transfer learning models across diverse real-world image classification and detection scenarios.

Method: Comparative study using a custom CNN architecture against widely used pretrained and transfer learning CNN models across five real-world image datasets spanning binary classification, fine-grained multiclass recognition, and object detection. Analysis of architectural factors like network depth, residual connections, and feature extraction strategies.

Result: Deeper CNN architectures provide substantial performance gains on fine-grained multiclass datasets, while lightweight pretrained and transfer learning models remain highly effective for simpler binary classification tasks. The custom architecture was successfully extended to object detection, demonstrating adaptability in identifying unauthorized auto-rickshaws in traffic scenes.

Conclusion: The study provides practical guidance for selecting suitable network designs based on task complexity and resource constraints, showing that architectural choices should be tailored to specific problem requirements rather than using one-size-fits-all approaches.

Abstract: This paper presents a comparative study of a custom convolutional neural network (CNN) architecture against widely used pretrained and transfer learning CNN models across five real-world image datasets. The datasets span binary classification, fine-grained multiclass recognition, and object detection scenarios. We analyze how architectural factors, such as network depth, residual connections, and feature extraction strategies, influence classification and localization performance. The results show that deeper CNN architectures provide substantial performance gains on fine-grained multiclass datasets, while lightweight pretrained and transfer learning models remain highly effective for simpler binary classification tasks. Additionally, we extend the proposed architecture to an object detection setting, demonstrating its adaptability in identifying unauthorized auto-rickshaws in real-world traffic scenes. Building upon a systematic analysis of custom CNN architectures alongside pretrained and transfer learning models, this study provides practical guidance for selecting suitable network designs based on task complexity and resource constraints.

[186] Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation

Tianheng Cheng, Xinggang Wang, Junchao Liao, Wenyu Liu

Main category: cs.CV

TL;DR: GAIN introduces Guided Attentive Interpolation (GAI) for efficient semantic segmentation by adaptively interpolating high-resolution features with semantic guidance, achieving state-of-the-art speed-accuracy trade-off.

Details

Motivation: Current interpolation methods (e.g., bilinear) for generating high-resolution features suffer from feature misalignment and insufficient context, while enriching semantics requires heavy computation that hinders low-latency inference.

Method: Proposes Guided Attentive Interpolation (GAI) that determines both spatial and semantic relations between pixels from multi-resolution features, then uses these relations to interpolate high-resolution features with rich semantics. Can be integrated with any deep convolutional network.

Result: GAIN achieves 78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 FPS on CamVid using NVIDIA 1080Ti GPU, setting new state-of-the-art for low-latency semantic segmentation.

Conclusion: GAI effectively addresses feature misalignment and semantic insufficiency in high-resolution feature generation while maintaining computational efficiency, enabling practical low-latency semantic segmentation with state-of-the-art performance.

Abstract: Semantic segmentation is a fundamental problem in computer vision and it requires high-resolution feature maps for dense prediction. Current coordinate-guided low-resolution feature interpolation methods, e.g., bilinear interpolation, produce coarse high-resolution features which suffer from feature misalignment and insufficient context information. Moreover, enriching semantics to high-resolution features requires a high computation burden, so that it is challenging to meet the requirement of lowlatency inference. We propose a novel Guided Attentive Interpolation (GAI) method to adaptively interpolate fine-grained high-resolution features with semantic features to tackle these issues. Guided Attentive Interpolation determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics. GAI can be integrated with any deep convolutional network for efficient semantic segmentation. In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU, which are the new state-of-the-art results of low-latency semantic segmentation. Code and models are available at: https://github.com/hustvl/simpleseg.

Andrés Bell-Navas, Jesús Garicano-Mena, Antonella Ausiello, Soledad Le Clainche, María Villalba-Orero, Enrique Lara-Pezzi

Main category: cs.CV

TL;DR: CardioMOD-Net: AI framework using echocardiography videos for multiclass HFpEF diagnosis and continuous prediction of disease onset in preclinical mouse models.

Details

Motivation: HFpEF is difficult to diagnose early due to diverse comorbidities and prolonged subclinical stages. Current AI models only do binary detection without comorbidity phenotyping or temporal progression estimates.

Method: Used mouse echocardiography videos from four groups (control, hyperglycemic, obesity, hypertension). Applied Higher Order Dynamic Mode Decomposition to extract temporal features, then used Vision Transformers for classification (diagnosis) and regression (predicting age at HFpEF onset).

Result: 65% overall diagnostic accuracy across four groups, with all classes >50% accuracy. Prognostic module achieved RMSE of 21.72 weeks for time-to-HFpEF prediction, with obesity and hypertension models showing most accurate estimates.

Conclusion: Unified framework demonstrates multiclass phenotyping and continuous HFpEF onset prediction from single cine loop, even with small data. Provides foundation for integrated diagnostic/prognostic modeling in preclinical HFpEF research.

Abstract: Introduction: Heart failure with preserved ejection fraction (HFpEF) arises from diverse comorbidities and progresses through prolonged subclinical stages, making early diagnosis and prognosis difficult. Current echocardiography-based Artificial Intelligence (AI) models focus primarily on binary HFpEF detection in humans and do not provide comorbidity-specific phenotyping or temporal estimates of disease progression towards decompensation. We aimed to develop a unified AI framework, CardioMOD-Net, to perform multiclass diagnosis and continuous prediction of HFpEF onset directly from standard echocardiography cine loops in preclinical models. Methods: Mouse echocardiography videos from four groups were used: control (CTL), hyperglycaemic (HG), obesity (OB), and systemic arterial hypertension (SAH). Two-dimensional parasternal long-axis cine loops were decomposed using Higher Order Dynamic Mode Decomposition (HODMD) to extract temporal features for downstream analysis. A shared latent representation supported Vision Transformers, one for a classifier for diagnosis and another for a regression module for predicting the age at HFpEF onset. Results: Overall diagnostic accuracy across the four groups was 65%, with all classes exceeding 50% accuracy. Misclassifications primarily reflected early-stage overlap between OB or SAH and CTL. The prognostic module achieved a root-mean-square error of 21.72 weeks for time-to-HFpEF prediction, with OB and SAH showing the most accurate estimates. Predicted HFpEF onset closely matched true distributions in all groups. Discussion: This unified framework demonstrates that multiclass phenotyping and continuous HFpEF onset prediction can be obtained from a single cine loop, even under small-data conditions. The approach offers a foundation for integrating diagnostic and prognostic modelling in preclinical HFpEF research.

[188] GenCAMO: Scene-Graph Contextual Decoupling for Environment-aware and Mask-free Camouflage Image-Dense Annotation Generation

Chenglizhao Chen, Shaojiang Yuan, Xiaoxue Lu, Mengke Song, Jia Song, Zhenyu Wu, Wenfeng Song, Shuai Li

Main category: cs.CV

TL;DR: GenCAMO introduces a generative framework to create synthetic camouflage datasets with dense annotations, addressing data scarcity in camouflage object detection and segmentation tasks.

Details

Motivation: High-quality camouflage datasets with dense annotations are scarce due to expensive collection and labeling costs, limiting progress in camouflage dense prediction tasks like RGB-D camouflage object detection and open-vocabulary camouflage object segmentation.

Method: Proposes GenCAMO, an environment-aware and mask-free generative framework that produces high-fidelity camouflage images with dense annotations. Also introduces GenCAMO-DB, a large-scale camouflage dataset with multi-modal annotations including depth maps, scene graphs, attribute descriptions, and text prompts.

Result: Extensive experiments across multiple modalities demonstrate that GenCAMO significantly improves dense prediction performance on complex camouflage scenes by providing high-quality synthetic data.

Conclusion: The generative approach effectively addresses data scarcity in camouflage analysis, enabling better training of CDP models with fine-grained representations, prior knowledge, and auxiliary reasoning. The code and datasets will be released publicly.

Abstract: Conceal dense prediction (CDP), especially RGB-D camouflage object detection and open-vocabulary camouflage object segmentation, plays a crucial role in advancing the understanding and reasoning of complex camouflage scenes. However, high-quality and large-scale camouflage datasets with dense annotation remain scarce due to expensive data collection and labeling costs. To address this challenge, we explore leveraging generative models to synthesize realistic camouflage image-dense data for training CDP models with fine-grained representations, prior knowledge, and auxiliary reasoning. Concretely, our contributions are threefold: (i) we introduce GenCAMO-DB, a large-scale camouflage dataset with multi-modal annotations, including depth maps, scene graphs, attribute descriptions, and text prompts; (ii) we present GenCAMO, an environment-aware and mask-free generative framework that produces high-fidelity camouflage image-dense annotations; (iii) extensive experiments across multiple modalities demonstrate that GenCAMO significantly improves dense prediction performance on complex camouflage scenes by providing high-quality synthetic data. The code and datasets will be released after paper acceptance.

Hao Lu, Xuhui Zhu, Wenjing Zhang, Yanan Li, Xiang Bai

Main category: cs.CV

TL;DR: OMAN++ is a novel Video Individual Counting (VIC) method that improves performance in crowded scenes by introducing one-to-many matching and displacement priors, validated on a new challenging dataset WuhanMetroCrowd.

Details

Motivation: Existing VIC approaches underperform in congested scenes like metro commuting. The paper addresses this limitation by recognizing that VIC is fundamentally a correspondence problem that requires better handling of crowded scenarios.

Method: Proposes OMAN++ with two key innovations: 1) Relaxes standard one-to-one matching to one-to-many matching using social grouping prior, implemented via implicit context generator and O2M matcher; 2) Uses spatial-temporal displacement prior via displacement prior injector to strengthen matching, feature extraction, and training.

Result: OMAN++ outperforms state-of-the-art VIC baselines on SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, and achieves 38.12% error reduction on the new WuhanMetroCrowd dataset, showing clear advantage in crowded scenes.

Conclusion: The paper introduces a novel VIC baseline OMAN++ that effectively handles crowded scenes by leveraging social grouping and displacement priors, with validation on a new challenging dataset demonstrating significant performance improvements.

Abstract: Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.

[190] RefSR-Adv: Adversarial Attack on Reference-based Image Super-Resolution Models

Jiazhu Dai, Huihui Jiang

Main category: cs.CV

TL;DR: RefSR-Adv: An adversarial attack that degrades super-resolution outputs by perturbing only the reference image, revealing security vulnerabilities in Reference-based Super-Resolution systems.

Details

Motivation: Existing research focuses on backdoor attacks for RefSR, but the vulnerability to adversarial attacks hasn't been explored. The paper aims to fill this research gap by investigating adversarial attacks targeting RefSR systems.

Method: Proposes RefSR-Adv, an adversarial attack that perturbs only the reference image to maximize the difference between adversarial and clean outputs. The attack is tested across CNN, Transformer, and Mamba architectures on CUFED5, WR-SR, and DRefSR datasets.

Result: RefSR-Adv induces significant performance degradation and severe artifacts across all tested architectures. Experiments show a positive correlation between attack effectiveness and similarity between low-resolution input and reference image, revealing that models’ over-reliance on reference features is a key security flaw.

Conclusion: This study reveals a security vulnerability in RefSR systems where adversarial perturbations to reference images can severely degrade outputs. The research aims to urge attention to RefSR robustness and highlights the need for more secure reference-based super-resolution approaches.

Abstract: Single Image Super-Resolution (SISR) aims to recover high-resolution images from low-resolution inputs. Unlike SISR, Reference-based Super-Resolution (RefSR) leverages an additional high-resolution reference image to facilitate the recovery of high-frequency textures. However, existing research mainly focuses on backdoor attacks targeting RefSR, while the vulnerability of the adversarial attacks targeting RefSR has not been fully explored. To fill this research gap, we propose RefSR-Adv, an adversarial attack that degrades SR outputs by perturbing only the reference image. By maximizing the difference between adversarial and clean outputs, RefSR-Adv induces significant performance degradation and generates severe artifacts across CNN, Transformer, and Mamba architectures on the CUFED5, WR-SR, and DRefSR datasets. Importantly, experiments confirm a positive correlation between the similarity of the low-resolution input and the reference image and attack effectiveness, revealing that the model’s over-reliance on reference features is a key security flaw. This study reveals a security vulnerability in RefSR systems, aiming to urge researchers to pay attention to the robustness of RefSR.

[191] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong

Main category: cs.CV

TL;DR: XStreamVGGT compresses KV cache through pruning and quantization for memory-efficient streaming 3D reconstruction without performance loss.

Details

Motivation: StreamVGGT suffers from unbounded KV cache growth causing escalating memory consumption and inference latency as frames accumulate, limiting practical streaming applications.

Method: Joint pruning and quantization of KV cache: pruning redundant KVs from multi-view inputs via token importance identification, and quantizing KV tensors based on their unique distributions.

Result: Achieves mostly negligible performance degradation while reducing memory usage by 4.42× and accelerating inference by 5.48× compared to StreamVGGT.

Conclusion: XStreamVGGT enables scalable and practical streaming 3D applications through tuning-free KV cache compression, making transformer-based 3D reconstruction more memory-efficient.

Abstract: Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs originating from multi-view inputs are pruned through efficient token importance identification, enabling a fixed memory budget. Leveraging the unique distribution of KV tensors, we incorporate KV quantization to further reduce memory consumption. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling scalable and practical streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.

[192] Real-Time LiDAR Point Cloud Densification for Low-Latency Spatial Data Transmission

Kazuhiko Murasaki, Shunsuke Konagai, Masakatsu Aoki, Taiga Yoshida, Ryuichi Tanida

Main category: cs.CV

TL;DR: High-speed LiDAR point cloud densification method for real-time dense 3D scene generation using joint bilateral filtering with CNN, achieving 30 fps HD depth maps.

Details

Motivation: Need for low-latency spatial transmission in immersive telepresence requires dense 3D scene capture and real-time processing, but LiDAR produces sparse point clouds that need densification.

Method: Combines multiple LiDAR inputs with high-resolution color images using joint bilateral filtering strategy implemented through convolutional neural network architecture.

Result: Method produces dense depth maps at full HD resolution in real time (30 fps), 15x faster than recent training-based approaches, with accurate geometry and no artifacts.

Conclusion: Proposed approach enables high-speed LiDAR point cloud densification for real-time immersive telepresence applications with minimal latency and high-quality results.

Abstract: To realize low-latency spatial transmission system for immersive telepresence, there are two major problems: capturing dynamic 3D scene densely and processing them in real time. LiDAR sensors capture 3D in real time, but produce sparce point clouds. Therefore, this paper presents a high-speed LiDAR point cloud densification method to generate dense 3D scene with minimal latency, addressing the need for on-the-fly depth completion while maintaining real-time performance. Our approach combines multiple LiDAR inputs with high-resolution color images and applies a joint bilateral filtering strategy implemented through a convolutional neural network architecture. Experiments demonstrate that the proposed method produces dense depth maps at full HD resolution in real time (30 fps), which is over 15x faster than a recent training-based depth completion approach. The resulting dense point clouds exhibit accurate geometry without multiview inconsistencies or ghosting artifacts.

[193] Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation

Riccardo Gelato, Carlo Sgaravatti, Jakob Grahn, Giacomo Boracchi, Filippo Maria Bianchi

Main category: cs.CV

TL;DR: Adapting Segment Anything Model (SAM) to Sentinel-1 SAR data for faster avalanche segmentation and mapping annotation.

Details

Motivation: Current avalanche mapping using SAR imagery requires time-consuming expert annotations. Need to accelerate annotation process for risk forecasting and mitigation in mountain regions.

Method: Adapt SAM foundation model to SAR domain using: (1) adapters to mitigate domain gap, (2) multiple encoders for multi-channel SAR inputs, (3) prompt-engineering strategies for better localization, (4) efficient training algorithm limiting encoder training time.

Result: Developed model integrated into annotation tool that speeds up SAR image annotation for avalanche mapping.

Conclusion: Successfully adapted SAM to SAR domain, overcoming domain mismatch, input constraints, prompt sensitivity, and training efficiency challenges for practical avalanche annotation.

Abstract: Remote sensing solutions for avalanche segmentation and mapping are key to supporting risk forecasting and mitigation in mountain regions. Synthetic Aperture Radar (SAR) imagery from Sentinel-1 can be effectively used for this task, but training an effective detection model requires gathering a large dataset with high-quality annotations from domain experts, which is prohibitively time-consuming. In this work, we aim to facilitate and accelerate the annotation of SAR images for avalanche mapping. We build on the Segment Anything Model (SAM), a segmentation foundation model trained on natural images, and tailor it to Sentinel-1 SAR data. Adapting SAM to our use-case requires addressing several domain-specific challenges: (i) domain mismatch, since SAM was not trained on satellite/SAR imagery; (ii) input adaptation, because SAR products typically provide more than three channels, while SAM is constrained to RGB images; (iii) robustness to imprecise prompts that can affect target identification and degrade the segmentation quality, an issue exacerbated in small, low-contrast avalanches; and (iv) training efficiency, since standard fine-tuning is computationally demanding for SAM. We tackle these challenges through a combination of adapters to mitigate the domain gap, multiple encoders to handle multi-channel SAR inputs, prompt-engineering strategies to improve avalanche localization accuracy, and a training algorithm that limits the training time of the encoder, which is recognized as the major bottleneck. We integrate the resulting model into an annotation tool and show experimentally that it speeds up the annotation of SAR images.

[194] UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

Mengfei Li, Peng Li, Zheng Zhang, Jiahao Lu, Chengfeng Zhao, Wei Xue, Qifeng Liu, Sida Peng, Wenxiao Zhang, Wenhan Luo, Yuan Liu, Yike Guo

Main category: cs.CV

TL;DR: UniSH is a unified feed-forward framework for joint metric-scale 3D scene and human reconstruction that addresses sim-to-real domain gaps through innovative training with unlabeled in-the-wild data.

Details

Motivation: The key challenge is the scarcity of large-scale annotated real-world data, forcing reliance on synthetic datasets which introduces significant sim-to-real domain gaps, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos.

Method: Proposes a training paradigm leveraging unlabeled in-the-wild data with two core components: (1) robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) two-stage supervision scheme that first learns coarse localization on synthetic data, then fine-tunes on real data by optimizing geometric correspondence between SMPL mesh and human point cloud.

Result: Achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods.

Conclusion: UniSH enables joint recovery of high-fidelity scene geometry, human point clouds, camera parameters, and coherent metric-scale SMPL bodies in a single forward pass, effectively bridging the sim-to-real domain gap.

Abstract: We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods. Project page: https://murphylmf.github.io/UniSH/

[195] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

Main category: cs.CV

TL;DR: CODA improves Slot Attention with diffusion models by adding register slots to reduce interference and using contrastive alignment to strengthen slot-image correspondence, achieving better object discovery and generation.

Details

Motivation: Slot Attention with pretrained diffusion models suffers from slot entanglement and weak alignment between object slots and image content, limiting its effectiveness for object-centric learning in complex scenes.

Method: CODA introduces two key components: (1) register slots that absorb residual attention to reduce interference between object slots, and (2) a contrastive alignment loss to explicitly encourage slot-image correspondence, serving as a tractable surrogate for maximizing mutual information between slots and inputs.

Result: CODA improves object discovery (+6.1% FG-ARI on COCO), property prediction, and compositional image generation on both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO) over strong baselines, with register slots adding negligible overhead.

Conclusion: CODA represents an effective framework for robust object-centric learning in complex, real-world scenes, demonstrating potential applications through improved slot representation quality and efficient scalability.

Abstract: Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes.

[196] HyDRA: Hybrid Denoising Regularization for Measurement-Only DEQ Training

Markus Haltmeier, Lukas Neumann, Nadja Gruber, Johannes Schwab, Gyeongha Hwang

Main category: cs.CV

TL;DR: HyDRA enables Deep Equilibrium model training using only measurement data (no ground truth pairs), combining measurement consistency with adaptive denoising regularization and data-driven early stopping.

Details

Motivation: Traditional DEQ models require supervised pairs (x,y) which are often unavailable in practice. Many real-world settings only have measurements y available, creating a need for measurement-only training frameworks.

Method: HyDRA combines measurement consistency with an adaptive denoising regularization term and uses a data-driven early stopping criterion. It’s designed for training DEQ models without requiring ground truth image pairs.

Result: Experiments on sparse-view CT show competitive reconstruction quality and fast inference compared to supervised methods.

Conclusion: HyDRA provides an effective measurement-only training framework for DEQ models that addresses the practical challenge of lacking supervised data while maintaining reconstruction quality.

Abstract: Solving image reconstruction problems of the form (\mathbf{A} \mathbf{x} = \mathbf{y}) remains challenging due to ill-posedness and the lack of large-scale supervised datasets. Deep Equilibrium (DEQ) models have been used successfully but typically require supervised pairs ((\mathbf{x},\mathbf{y})). In many practical settings, only measurements (\mathbf{y}) are available. We introduce HyDRA (Hybrid Denoising Regularization Adaptation), a measurement-only framework for DEQ training that combines measurement consistency with an adaptive denoising regularization term, together with a data-driven early stopping criterion. Experiments on sparse-view CT demonstrate competitive reconstruction quality and fast inference.

[197] RFAssigner: A Generic Label Assignment Strategy for Dense Object Detection

Ziqian Guan, Xieyi Fu, Yuting Wang, Haowen Xiao, Jiarui Zhu, Yingying Zhu, Yongtao Liu, Lin Gu

Main category: cs.CV

TL;DR: RFAssigner is a novel label assignment strategy for dense object detectors that addresses scale imbalance by adaptively selecting supplementary positive samples for small objects using Gaussian Receptive Field similarity.

Details

Motivation: Current label assignment methods in dense object detectors often assign insufficient positive samples to small objects, creating scale imbalance during training that hinders multi-scale learning capabilities.

Method: RFAssigner first establishes initial positive samples using point-based prior, then measures similarity between unassigned candidate locations and ground-truth objects using Gaussian Receptive Field (GRF) distance, and adaptively selects supplementary positive samples from unassigned pool.

Result: Comprehensive experiments on three datasets with distinct object scale distributions show RFAssigner achieves state-of-the-art performance across all object scales. A single FCOS-ResNet-50 detector with RFAssigner consistently outperforms existing strategies without needing auxiliary modules or heuristics.

Conclusion: RFAssigner effectively addresses scale imbalance in dense object detectors by enhancing multi-scale learning capabilities through adaptive positive sample selection based on Gaussian Receptive Field similarity, demonstrating strong generalizability across different datasets.

Abstract: Label assignment is a critical component in training dense object detectors. State-of-the-art methods typically assign each training sample a positive and a negative weight, optimizing the assignment scheme during training. However, these strategies often assign an insufficient number of positive samples to small objects, leading to a scale imbalance during training. To address this limitation, we introduce RFAssigner, a novel assignment strategy designed to enhance the multi-scale learning capabilities of dense detectors. RFAssigner first establishes an initial set of positive samples using a point-based prior. It then leverages a Gaussian Receptive Field (GRF) distance to measure the similarity between the GRFs of unassigned candidate locations and the ground-truth objects. Based on this metric, RFAssigner adaptively selects supplementary positive samples from the unassigned pool, promoting a more balanced learning process across object scales. Comprehensive experiments on three datasets with distinct object scale distributions validate the effectiveness and generalizability of our method. Notably, a single FCOS-ResNet-50 detector equipped with RFAssigner achieves state-of-the-art performance across all object scales, consistently outperforming existing strategies without requiring auxiliary modules or heuristics.

[198] MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance

Hamad Khan, Saddam Hussain Khan

Main category: cs.CV

TL;DR: Proposes MambaFormer, an LLM-based hybrid MoE framework combining Transformer and State Space Model experts for efficient medical QA, achieving 24.4× speedup over T5-Large with high accuracy.

Details

Motivation: Address the computational cost vs. efficiency trade-off in deploying LLMs for clinical applications, enabling resource-constrained clinical deployment.

Method: Hybrid MoE framework with lightweight gating mechanism that dynamically routes tokens to customized Transformer expert (ET5) for short complex queries or State Space Model expert (EMamba) for long sequences. Uses utility-guided multi-objective loss for joint optimization.

Result: Outperforms SOTA with BERTScore 0.9180 and ultra-low latency (0.077s), achieving 24.4× speedup over T5-Large on DentalQA and PubMedQA datasets.

Conclusion: MambaFormer establishes a scalable solution for efficient medical QA in resource-constrained clinical settings through intelligent expert routing and Pareto-optimal trade-off between latency and accuracy.

Abstract: The deployment of large language models (LLMs) in real-world clinical applications is constrained by the fundamental trade-off between computational cost and the efficiency of linear-time models. To address this, we propose an LLM-based MambaFormer hybrid Mixture-of-Experts (MoE) framework for efficient medical question-answering (QA) and clinical assistance. The MambaFormer employs a lightweight gating mechanism that performs token-level dynamic routing to a customized Transformer expert (ET5) for short, complex queries or to a State Space Model expert (EMamba) for long, high-throughput sequences. The customized EMamba and ET5 models are tailored to accommodate input sequence dimensionality, embedding structure, sequence length, and target-specific output heads, and are fine-tuned through transfer learning on a new, custom-designed DentalQA dataset. Moreover, intelligent routing decisions are driven by the contextual complexity of token embeddings, normalized sequence length, and domain-aware features, thereby enforcing a Pareto-optimal trade-off between inference latency and prediction accuracy. Furthermore, a novel utility-guided multi-objective loss jointly optimizes decisions, router parameters, routing behavior, expert utilization, and computational cost by adaptively regulating token-level expert activation. Finally, the proposed MambaFormer is cross-validated (holdout) for medical QA on the new, custom-designed DentalQA and PubMedQA datasets and compared with state-of-the-art techniques. The proposed MambaFormer outperforms (BERTScore = 0.9180) with ultra-low latency (0.077 s), delivering a 24.4 speedup over T5-Large and establishing a scalable solution for resource-constrained clinical deployment.

[199] AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures

Sifatullah Sheikh Urmi, Kirtonia Nuzath Tabassum Arthi, Md Al-Imran

Main category: cs.CV

TL;DR: AI models (3 CNNs + 1 Vision Transformer) evaluated for deepfake detection; VFDNET with MobileNetV3 showed best accuracy and efficiency.

Details

Motivation: The increasing prevalence of AI-generated deepfakes poses significant challenges to maintaining digital authenticity and trust in visual media.

Method: Evaluated four AI-based models (three CNNs and one Vision Transformer) using large face image datasets with data preprocessing and augmentation techniques.

Result: VFDNET with MobileNetV3 demonstrated superior accuracy and efficient performance across different scenarios.

Conclusion: AI-based approaches, particularly VFDNET with MobileNetV3, show strong capabilities for dependable deepfake detection, addressing the growing challenge of digital authenticity.

Abstract: The increasing use of artificial intelligence generated deepfakes creates major challenges in maintaining digital authenticity. Four AI-based models, consisting of three CNNs and one Vision Transformer, were evaluated using large face image datasets. Data preprocessing and augmentation techniques improved model performance across different scenarios. VFDNET demonstrated superior accuracy with MobileNetV3, showing efficient performance, thereby demonstrating AI’s capabilities for dependable deepfake detection.

[200] S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss

Md. Sanaullah Chowdhury Lameya Sabrin

Main category: cs.CV

TL;DR: S2M-Net is a lightweight medical image segmentation model that achieves global context with O(HW log HW) complexity using spectral token mixing and adaptive loss functions, outperforming transformers with 3.5-6x fewer parameters.

Details

Motivation: Medical image segmentation faces a trilemma: need for local precision (boundary accuracy), global context (anatomical coherence), and computational efficiency (limited data/hardware). Convolutional networks have limited receptive fields, while transformers have quadratic computational cost causing overfitting on small clinical datasets.

Method: Two key innovations: (1) Spectral-Selective Token Mixer (SSTM) uses truncated 2D FFT with learnable frequency filtering and content-gated spatial projection for O(HW log HW) global context; (2) Morphology-Aware Adaptive Segmentation Loss (MASL) automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights.

Result: State-of-the-art performance across 16 medical imaging datasets spanning 8 modalities: 96.12% Dice on polyp segmentation, 83.77% on surgical instruments (+17.85% over prior art), 80.90% on brain tumors, with consistent 3-18% improvements over specialized baselines using only 4.7M parameters (3.5-6x fewer than transformer methods).

Conclusion: S2M-Net resolves the medical segmentation trilemma by providing global context at near-linear cost, eliminating manual loss tuning, and achieving superior performance with dramatically fewer parameters than transformer-based approaches.

Abstract: Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at $\mathcal{O}(n)$ cost but limited receptive fields, vision transformers achieve global context through $\mathcal{O}(n^2)$ self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves $\mathcal{O}(HW \log HW)$ global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12% Dice on polyp segmentation, 83.77% on surgical instruments (+17.85% over the prior art) and 80.90% on brain tumors, with consistent 3-18% improvements over specialized baselines while using 3.5–6$\times$ fewer parameters than transformer-based methods.

[201] VReID-XFD: Video-based Person Re-identification at Extreme Far Distance Challenge Results

Kailash A. Hambarde, Hugo Proença, Md Rashidunnabi, Pranita Samale, Qiwei Yang, Pingping Zhang, Zijing Gong, Yuhao Wang, Xi Zhang, Ruoshui Qu, Qiaoyun He, Yuhang Zhang, Thi Ngoc Ha Nguyen, Tien-Dung Mai, Cheng-Jun Kang, Yu-Fan Lin, Jin-Hui Jiang, Chih-Chung Hsu, Tamás Endrei, György Cserey, Ashwat Rajbhandari

Main category: cs.CV

TL;DR: VReID-XFD is a new benchmark for extreme far-distance aerial-to-ground person re-identification, featuring 371 identities across 11.75M frames captured from 5.8-120m altitudes, showing severe performance degradation with distance and nadir views.

Details

Motivation: Existing person re-identification systems fail in extreme far-distance scenarios due to severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation that undermine appearance-based assumptions.

Method: Created VReID-XFD benchmark derived from DetReIDX dataset with 371 identities, 11,288 tracklets, and 11.75M frames captured across altitudes (5.8-120m), viewing angles (30-90 degrees), and horizontal distances up to 120m. Includes strict identity-disjoint splits and rich physical metadata.

Result: Challenge attracted 10 teams with hundreds of submissions. Analysis shows monotonic performance degradation with altitude/distance, universal disadvantage of nadir views, and trade-off between peak performance and robustness. Best method (SAS-PReID) achieves only 43.93% mAP in aerial-to-ground setting.

Conclusion: VReID-XFD establishes a challenging benchmark for extreme far-distance aerial-to-ground person re-identification, revealing fundamental limitations of current methods and providing a public dataset for future research.

Abstract: Person re-identification (ReID) across aerial and ground views at extreme far distances introduces a distinct operating regime where severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation jointly undermine the appearance-based assumptions of existing ReID systems. To study this regime, we introduce VReID-XFD, a video-based benchmark and community challenge for extreme far-distance (XFD) aerial-to-ground person re-identification. VReID-XFD is derived from the DetReIDX dataset and comprises 371 identities, 11,288 tracklets, and 11.75 million frames, captured across altitudes from 5.8 m to 120 m, viewing angles from oblique (30 degrees) to nadir (90 degrees), and horizontal distances up to 120 m. The benchmark supports aerial-to-aerial, aerial-to-ground, and ground-to-aerial evaluation under strict identity-disjoint splits, with rich physical metadata. The VReID-XFD-25 Challenge attracted 10 teams with hundreds of submissions. Systematic analysis reveals monotonic performance degradation with altitude and distance, a universal disadvantage of nadir views, and a trade-off between peak performance and robustness. Even the best-performing SAS-PReID method achieves only 43.93 percent mAP in the aerial-to-ground setting. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/ .

Weihang You, Hanqi Jiang, Yi Pan, Junhao Chen, Tianming Liu, Fei Dou

Main category: cs.CV

TL;DR: NeuroAlign: A hierarchical fMRI-video alignment framework inspired by human visual system, using Neural-Temporal Contrastive Learning and enhanced vector quantization for fine-grained cross-modal matching.

Details

Motivation: Existing methods fail to capture the hierarchical and temporal nature of visual processing in the brain, reducing neural decoding to simple generation tasks or correlations without reflecting biological visual pathways.

Method: Two-stage framework: 1) Global semantic understanding via Neural-Temporal Contrastive Learning (NTCL) with bidirectional prediction between fMRI and video modalities, 2) Fine-grained pattern matching through enhanced vector quantization, plus DynaSyncMM-EMA for dynamic multi-modal fusion with adaptive weighting.

Result: NeuroAlign significantly outperforms existing methods in cross-modal retrieval tasks, demonstrating superior alignment between fMRI data and video stimuli.

Conclusion: The framework establishes a new paradigm for understanding visual cognitive mechanisms by better mirroring the hierarchical organization of the human visual system.

Abstract: Understanding neural responses to visual stimuli remains challenging due to the inherent complexity of brain representations and the modality gap between neural data and visual inputs. Existing methods, mainly based on reducing neural decoding to generation tasks or simple correlations, fail to reflect the hierarchical and temporal processes of visual processing in the brain. To address these limitations, we present NeuroAlign, a novel framework for fine-grained fMRI-video alignment inspired by the hierarchical organization of the human visual system. Our framework implements a two-stage mechanism that mirrors biological visual pathways: global semantic understanding through Neural-Temporal Contrastive Learning (NTCL) and fine-grained pattern matching through enhanced vector quantization. NTCL explicitly models temporal dynamics through bidirectional prediction between modalities, while our DynaSyncMM-EMA approach enables dynamic multi-modal fusion with adaptive weighting. Experiments demonstrate that NeuroAlign significantly outperforms existing methods in cross-modal retrieval tasks, establishing a new paradigm for understanding visual cognitive mechanisms.

[203] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding

Yixuan Lai, He Wang, Kun Zhou, Tianjia Shao

Main category: cs.CV

TL;DR: A video generation method that uses short reference videos instead of single images to better preserve subject identity and natural motion in generated videos.

Details

Motivation: Current methods using single reference images fail to capture temporal dynamics, leading to pose-locked motions, unnatural warping, and generic "average" faces when viewpoints and expressions change.

Method: Introduces identity-conditioned diffusion-transformer video generator using short reference videos. Uses Sinkhorn-routed encoder to learn compact identity tokens that capture subject-specific dynamics while remaining compatible with pretrained backbones.

Result: Consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.

Conclusion: Using short reference videos with learned identity tokens that capture temporal dynamics significantly improves identity preservation and motion naturalness in video generation compared to single-image conditioning.

Abstract: Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and “average” faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.

[204] Advanced Machine Learning Approaches for Enhancing Person Re-Identification Performance

Dang H. Pham, Tu N. Nguyen, Hoa N. Nguyen

Main category: cs.CV

TL;DR: This dissertation proposes three advanced person re-identification methods for different settings: supervised (SCM-ReID), unsupervised domain adaptation (IQAGA/DAPRH), and fully unsupervised (ViTC-UReID), achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Person re-identification faces challenges including appearance variations, domain shifts, and limited labeled data, which hinder robust deployment in real-world surveillance systems. The research aims to address these limitations through improved feature learning, domain adaptation, and handling of label noise.

Method: Three approaches: 1) SCM-ReID combines supervised contrastive learning with hybrid loss optimization (classification, center, triplet, centroid-triplet losses); 2) IQAGA/DAPRH use GAN-based image augmentation, domain-invariant mapping, and pseudo-label refinement for UDA; 3) ViTC-UReID employs Vision Transformer-based feature encoding with camera-aware proxy learning and global/local attention mechanisms.

Result: Achieved state-of-the-art accuracy on Market-1501 and CUHK03 with SCM-ReID; up to 12% mAP and Rank-1 improvements in UDA scenarios; and significantly outperformed existing unsupervised approaches on large-scale benchmarks including CUHK03, Market-1501, DukeMTMC-reID, and MSMT17.

Conclusion: The dissertation advances ReID research by addressing key limitations in feature learning, domain adaptation, and label noise handling, providing robust methods for real-world surveillance deployment across supervised, unsupervised domain adaptation, and fully unsupervised settings.

Abstract: Person re-identification (ReID) plays a critical role in intelligent surveillance systems by linking identities across multiple cameras in complex environments. However, ReID faces significant challenges such as appearance variations, domain shifts, and limited labeled data. This dissertation proposes three advanced approaches to enhance ReID performance under supervised, unsupervised domain adaptation (UDA), and fully unsupervised settings. First, SCM-ReID integrates supervised contrastive learning with hybrid loss optimization (classification, center, triplet, and centroid-triplet losses), improving discriminative feature representation and achieving state-of-the-art accuracy on Market-1501 and CUHK03 datasets. Second, for UDA, IQAGA and DAPRH combine GAN-based image augmentation, domain-invariant mapping, and pseudo-label refinement to mitigate domain discrepancies and enhance cross-domain generalization. Experiments demonstrate substantial gains over baseline methods, with mAP and Rank-1 improvements up to 12% in challenging transfer scenarios. Finally, ViTC-UReID leverages Vision Transformer-based feature encoding and camera-aware proxy learning to boost unsupervised ReID. By integrating global and local attention with camera identity constraints, this method significantly outperforms existing unsupervised approaches on large-scale benchmarks. Comprehensive evaluations across CUHK03, Market-1501, DukeMTMC-reID, and MSMT17 confirm the effectiveness of the proposed methods. The contributions advance ReID research by addressing key limitations in feature learning, domain adaptation, and label noise handling, paving the way for robust deployment in real-world surveillance systems.

[205] Garment Inertial Denoiser (GID): Endowing Accurate Motion Capture via Loose IMU Denoiser

Jiawei Fang, Ruonan Zheng, Xiaoxia Gao, Shifan Jiang, Anjun Chen, Qi Ye, Shihui Guo

Main category: cs.CV

TL;DR: GID is a lightweight Transformer that enables accurate inertial motion capture with loose-fitting garments by denoising sensor-body displacement, using location-aware experts and cross-wear fusion.

Details

Motivation: Wearable inertial MoCap requires tightly attached sensors which are intrusive for daily use. Loose-fitting garments cause sensor-body displacement that corrupts standard inertial pipelines, creating a need for denoising methods.

Method: GID uses a three-stage approach: location-specific denoising, adaptive cross-wear fusion, and general pose prediction. It employs a location-aware expert architecture with shared spatio-temporal backbone and per-IMU expert heads, plus a lightweight fusion module for cross-part consistency.

Result: GID enables accurate, real-time denoising from single-user training and generalizes across unseen users, motions, and garment types. It consistently improves state-of-the-art inertial MoCap methods as a drop-in module.

Conclusion: GID provides an effective solution for loose-wear inertial MoCap by factorizing the problem and using inductive biases for stable training, making wearable MoCap more practical for daily use with comfortable loose-fitting garments.

Abstract: Wearable inertial motion capture (MoCap) provides a portable, occlusion-free, and privacy-preserving alternative to camera-based systems, but its accuracy depends on tightly attached sensors - an intrusive and uncomfortable requirement for daily use. Embedding IMUs into loose-fitting garments is a desirable alternative, yet sensor-body displacement introduces severe, structured, and location-dependent corruption that breaks standard inertial pipelines. We propose GID (Garment Inertial Denoiser), a lightweight, plug-and-play Transformer that factorizes loose-wear MoCap into three stages: (i) location-specific denoising, (ii) adaptive cross-wear fusion, and (iii) general pose prediction. GID uses a location-aware expert architecture, where a shared spatio-temporal backbone models global motion while per-IMU expert heads specialize in local garment dynamics, and a lightweight fusion module ensures cross-part consistency. This inductive bias enables stable training and effective learning from limited paired loose-tight IMU data. We also introduce GarMoCap, a combined public and newly collected dataset covering diverse users, motions, and garments. Experiments show that GID enables accurate, real-time denoising from single-user training and generalizes across unseen users, motions, and garment types, consistently improving state-of-the-art inertial MoCap methods when used as a drop-in module.

[206] Unsupervised SE(3) Disentanglement for in situ Macromolecular Morphology Identification from Cryo-Electron Tomography

Mostofa Rafid Uddin, Mahek Vora, Qifeng Wu, Muyuan Chen, Min Xu

Main category: cs.CV

TL;DR: A deep learning framework for disentangling SE(3) transformations from morphological content in cryo-ET data to discover rare macromolecular structures.

Details

Motivation: Existing expectation-maximization methods for cryo-ET analysis often miss rare but important macromolecular morphologies and require extensive manual hyperparameter tuning, limiting their effectiveness in discovering novel structures.

Method: A disentangled deep representation learning framework that separates SE(3) transformations from morphological content using a novel multi-choice learning module specifically designed for highly noisy cryo-ET data.

Result: Experiments on simulated and real cryo-ET datasets demonstrate clear improvements over prior methods, including the discovery of previously unidentified macromolecular morphologies.

Conclusion: The proposed framework successfully addresses limitations of existing methods by enabling effective disentanglement of transformations from morphological content, leading to better discovery of rare macromolecular structures in cryo-ET data.

Abstract: Cryo-electron tomography (cryo-ET) provides direct 3D visualization of macromolecules inside the cell, enabling analysis of their in situ morphology. This morphology can be regarded as an SE(3)-invariant, denoised volumetric representation of subvolumes extracted from tomograms. Inferring morphology is therefore an inverse problem of estimating both a template morphology and its SE(3) transformation. Existing expectation-maximization based solution to this problem often misses rare but important morphologies and requires extensive manual hyperparameter tuning. Addressing this issue, we present a disentangled deep representation learning framework that separates SE(3) transformations from morphological content in the representation space. The framework includes a novel multi-choice learning module that enables this disentanglement for highly noisy cryo-ET data, and the learned morphological content is used to generate template morphologies. Experiments on simulated and real cryo-ET datasets demonstrate clear improvements over prior methods, including the discovery of previously unidentified macromolecular morphologies.

[207] ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking

Xiaobao Wei, Zhangjie Ye, Yuxiang Gu, Zunjie Zhu, Yunfei Guo, Yingying Shen, Shan Zhao, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Rongfeng Lu, Hangjun Ye

Main category: cs.CV

TL;DR: ParkRecon3D is the first benchmark for parking scene reconstruction, and ParkGaussian is a novel framework using 3D Gaussian Splatting with slot-aware reconstruction to improve both reconstruction quality and downstream parking slot detection.

Details

Motivation: Existing autonomous parking systems focus on 2D perception and localization, but lack 3D reconstruction capabilities needed for complex parking scenarios. Current reconstruction methods don't directly benefit parking tasks since they don't align with the critical parking slot perception module.

Method: 1) Created ParkRecon3D benchmark with surround-view fisheye camera data and dense parking slot annotations. 2) Proposed ParkGaussian framework using 3D Gaussian Splatting for reconstruction. 3) Introduced slot-aware reconstruction strategy that leverages existing parking perception methods to enhance synthesis quality in slot regions.

Result: ParkGaussian achieves state-of-the-art reconstruction quality on the ParkRecon3D benchmark and better preserves perception consistency for downstream parking tasks compared to existing methods.

Conclusion: The proposed ParkRecon3D benchmark and ParkGaussian framework successfully address the gap in 3D parking scene reconstruction, demonstrating that slot-aware reconstruction can improve both visual quality and downstream task performance for autonomous parking systems.

Abstract: Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released at: https://github.com/wm-research/ParkGaussian

[208] Evaluation of Convolutional Neural Network For Image Classification with Agricultural and Urban Datasets

Shamik Shafkat Avro, Nazira Jesmin Lina, Shahanaz Sharmin

Main category: cs.CV

TL;DR: CustomCNN architecture with residual connections, Squeeze-and-Excitation attention, progressive channel scaling, and Kaiming initialization achieves competitive performance on multi-domain image classification tasks for Smart City and agricultural applications.

Details

Motivation: To study how architectural design choices affect multi-domain image classification tasks, particularly for real-world applications in Smart City monitoring and agricultural imaging.

Method: Developed a custom CNN architecture incorporating residual connections, Squeeze-and-Excitation attention mechanisms, progressive channel scaling, and Kaiming initialization. Trained and tested on five diverse datasets covering vehicle detection, footpath encroachment, road damage/manhole detection, mango classification, and paddy variety identification.

Result: The CustomCNN delivers competitive performance compared to popular CNN architectures while maintaining computational efficiency across all five datasets.

Conclusion: Thoughtful architectural design is crucial for effective multi-domain image classification in real-world Smart City and agricultural applications, with the proposed CustomCNN demonstrating both performance and efficiency advantages.

Abstract: This paper presents the development and evaluation of a custom Convolutional Neural Network (CustomCNN) created to study how architectural design choices affect multi-domain image classification tasks. The network uses residual connections, Squeeze-and-Excitation attention mechanisms, progressive channel scaling, and Kaiming initialization to improve its ability to represent data and speed up training. The model is trained and tested on five publicly available datasets: unauthorized vehicle detection, footpath encroachment detection, polygon-annotated road damage and manhole detection, MangoImageBD and PaddyVarietyBD. A comparison with popular CNN architectures shows that the CustomCNN delivers competitive performance while remaining efficient in computation. The results underscore the importance of thoughtful architectural design for real-world Smart City and agricultural imaging applications.

[209] SwinIFS: Landmark Guided Swin Transformer For Identity Preserving Face Super Resolution

Habiba Kausar, Saeed Anwar, Omar Jamal Hammad, Abdul Bais

Main category: cs.CV

TL;DR: SwinIFS is a landmark-guided face super-resolution framework that uses facial landmark heatmaps and Swin Transformer architecture to preserve identity and restore fine details at extreme upscaling factors (up to 8x).

Details

Motivation: Face super-resolution is challenging due to loss of fine structural details and identity-specific features, especially at extreme upscaling factors where most methods fail to recover meaningful structure.

Method: Integrates dense Gaussian heatmaps of facial landmarks into input representation, uses a compact Swin Transformer backbone to capture long-range contextual information while preserving local geometry, and employs hierarchical attention mechanisms to focus on semantically important facial regions.

Result: Achieves superior perceptual quality, sharper reconstructions, and improved identity retention on CelebA benchmark; performs well even under 8x magnification where most methods fail; provides good balance between accuracy and computational efficiency.

Conclusion: SwinIFS enables identity-preserving face super-resolution at extreme upscaling factors, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration.

Abstract: Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at https://github.com/Habiba123-stack/SwinIFS.

[210] Mask-Guided Multi-Task Network for Face Attribute Recognition

Gong Gao, Zekai Wang, Jian Zhao, Ziqi Xie, Xianhui Liu, Weidong Zhao

Main category: cs.CV

TL;DR: MGMTN improves face attribute recognition by using adaptive mask learning to focus on specific facial regions and group-global feature fusion, reducing redundancy from global features.

Details

Motivation: Conventional multi-task attribute recognition methods process entire feature maps, producing redundant features due to reliance on global regions, which limits efficiency and accuracy.

Method: Proposes Mask-Guided Multi-Task Network (MGMTN) with two components: Adaptive Mask Learning (AML) uses pre-trained keypoint models to localize critical facial parts and generate group masks; Group-Global Feature Fusion (G2FF) combines group and global features for enhanced learning.

Result: Extensive experiments on two challenging facial attribute recognition datasets demonstrate MGMTN’s effectiveness in improving FAR performance.

Conclusion: MGMTN addresses limitations of conventional methods by focusing on specific feature regions and fusing group-global features, leading to more precise attribute identification and reduced negative transfer from global region usage.

Abstract: Face Attribute Recognition (FAR) plays a crucial role in applications such as person re-identification, face retrieval, and face editing. Conventional multi-task attribute recognition methods often process the entire feature map for feature extraction and attribute classification, which can produce redundant features due to reliance on global regions. To address these challenges, we propose a novel approach emphasizing the selection of specific feature regions for efficient feature learning. We introduce the Mask-Guided Multi-Task Network (MGMTN), which integrates Adaptive Mask Learning (AML) and Group-Global Feature Fusion (G2FF) to address the aforementioned limitations. Leveraging a pre-trained keypoint annotation model and a fully convolutional network, AML accurately localizes critical facial parts (e.g., eye and mouth groups) and generates group masks that delineate meaningful feature regions, thereby mitigating negative transfer from global region usage. Furthermore, G2FF combines group and global features to enhance FAR learning, enabling more precise attribute identification. Extensive experiments on two challenging facial attribute recognition datasets demonstrate the effectiveness of MGMTN in improving FAR performance.

[211] AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

Yue Zhou, Ran Ding, Xue Yang, Xue Jiang, Xingzhao Liu

Main category: cs.CV

TL;DR: AirSpatial introduces a spatially-aware dataset and VLM for drone vehicle imagery, addressing spatial understanding limitations in remote sensing VLMs through novel spatial tasks and 3D bounding boxes.

Details

Motivation: Existing remote sensing vision-language models struggle with spatial understanding, limiting their effectiveness in real-world applications, particularly for drone-captured vehicle imagery.

Method: Created AirSpatial dataset with 206K+ instructions and novel Spatial Grounding/Spatial QA tasks with 3DBB; used two-stage training (Image Understanding Pre-training + Spatial Understanding Fine-tuning); developed AirSpatialBot agent integrating task planning, image understanding, spatial understanding, and execution.

Result: Experimental results validate the approach’s effectiveness, reveal spatial limitations of existing VLMs, and provide valuable insights. The model enables fine-grained vehicle attribute recognition and retrieval.

Conclusion: AirSpatial advances remote sensing VLMs by addressing spatial understanding gaps, introduces first remote sensing grounding dataset with 3DBB, and demonstrates practical application through AirSpatialBot for aerial vehicle analysis.

Abstract: Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot

[212] DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He

Main category: cs.CV

TL;DR: DreamID-V is a diffusion transformer framework for video face swapping that achieves superior identity similarity and temporal consistency by leveraging image face swapping models and novel training strategies.

Details

Motivation: Existing video face swapping methods struggle to maintain identity similarity, attribute preservation, and temporal consistency simultaneously. There's a need to transfer the superiority of image face swapping to video domain while addressing limited benchmarks.

Method: Proposes SyncID-Pipe data pipeline for bidirectional ID quadruplets, DreamID-V diffusion transformer framework with Modality-Aware Conditioning module, Synthetic-to-Real Curriculum mechanism, and Identity-Coherence Reinforcement Learning strategy.

Result: DreamID-V outperforms state-of-the-art methods, demonstrates exceptional versatility across various swap-related tasks, and introduces comprehensive IDBench-V benchmark for evaluation.

Conclusion: The proposed framework successfully addresses video face swapping challenges by combining image face swapping superiority with novel video-specific techniques, achieving high-quality identity transfer with temporal consistency.

Abstract: Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.

[213] EdgeNeRF: Edge-Guided Regularization for Neural Radiance Fields from Sparse Views

Weiqi Yu, Yiyang Yao, Lin He, Jianming Lv

Main category: cs.CV

TL;DR: EdgeNeRF improves sparse-view 3D reconstruction by using edge-guided depth regularization to preserve geometric boundaries while reducing artifacts.

Details

Motivation: NeRF performance degrades significantly with sparse inputs, and existing global depth regularization methods lose geometric boundary details.

Method: Extract edges from input images, then apply depth and normal regularization only to non-edge regions to preserve boundary details while enhancing geometric consistency.

Result: Superior performance on LLFF and DTU datasets, especially in retaining sharp geometric boundaries and suppressing artifacts. The edge-guided module can be integrated into other methods as a plug-and-play component.

Conclusion: EdgeNeRF effectively addresses sparse-view reconstruction limitations by preserving boundary details through edge-guided regularization, offering a generalizable solution that improves existing methods.

Abstract: Neural Radiance Fields (NeRF) achieve remarkable performance in dense multi-view scenarios, but their reconstruction quality degrades significantly under sparse inputs due to geometric artifacts. Existing methods utilize global depth regularization to mitigate artifacts, leading to the loss of geometric boundary details. To address this problem, we propose EdgeNeRF, an edge-guided sparse-view 3D reconstruction algorithm. Our method leverages the prior that abrupt changes in depth and normals generate edges. Specifically, we first extract edges from input images, then apply depth and normal regularization constraints to non-edge regions, enhancing geometric consistency while preserving high-frequency details at boundaries. Experiments on LLFF and DTU datasets demonstrate EdgeNeRF’s superior performance, particularly in retaining sharp geometric boundaries and suppressing artifacts. Additionally, the proposed edge-guided depth regularization module can be seamlessly integrated into other methods in a plug-and-play manner, significantly improving their performance without substantially increasing training time. Code is available at https://github.com/skyhigh404/edgenerf.

[214] In defense of the two-stage framework for open-set domain adaptive semantic segmentation

Wenqi Ren, Weijie Wang, Meng Zheng, Ziyan Wu, Yang Tang, Zhun Zhong, Nicu Sebe

Main category: cs.CV

TL;DR: SATS proposes a two-stage Separating-then-Adapting Training Strategy for Open-Set Domain Adaptation in Semantic Segmentation, addressing known/unknown class imbalance through sequential separation and adaptation with hard unknown exploration.

Details

Motivation: Existing single-stage methods for Open-Set Domain Adaptation in Semantic Segmentation suffer from annotation imbalance between known and unknown classes, leading to negative transfer for known classes and underfitting for unknowns.

Method: SATS uses a two-step approach: 1) known/unknown separation, and 2) unknown-aware domain adaptation. It also introduces hard unknown exploration data augmentation to expose the model to more challenging unknown examples.

Result: Achieves substantial improvements: +3.85% H-Score for GTA5-to-Cityscapes and +18.64% for SYNTHIA-to-Cityscapes, outperforming previous state-of-the-art methods on public OSDA-SS benchmarks.

Conclusion: Separating known/unknown class learning and adapting with unknown awareness leads to more balanced feature learning and better discovery of truly unknown objects in semantic segmentation domain adaptation.

Abstract: Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) presents a significant challenge, as it requires both domain adaptation for known classes and the distinction of unknowns. Existing methods attempt to address both tasks within a single unified stage. We question this design, as the annotation imbalance between known and unknown classes often leads to negative transfer of known classes and underfitting for unknowns. To overcome these issues, we propose SATS, a Separating-then-Adapting Training Strategy, which addresses OSDA-SS through two sequential steps: known/unknown separation and unknown-aware domain adaptation. By providing the model with more accurate and well-aligned unknown classes, our method ensures a balanced learning of discriminative features for both known and unknown classes, steering the model toward discovering truly unknown objects. Additionally, we present hard unknown exploration, an innovative data augmentation method that exposes the model to more challenging unknowns, strengthening its ability to capture more comprehensive understanding of target unknowns. We evaluate our method on public OSDA-SS benchmarks. Experimental results demonstrate that our method achieves a substantial advancement, with a +3.85% H-Score improvement for GTA5-to-Cityscapes and +18.64% for SYNTHIA-to-Cityscapes, outperforming previous state-of-the-art methods.

[215] PartImageNet++ Dataset: Enhancing Visual Models with High-Quality Part Annotations

Xiao Li, Zilong Liu, Yining Liu, Zhuhong Li, Na Dong, Sitian Qin, Xiaolin Hu

Main category: cs.CV

TL;DR: PIN++ is a comprehensive part-annotated dataset for ImageNet-1K with 100K images, used to train MPM for robust classification and establish baselines for multiple downstream tasks.

Details

Motivation: Addressing the scarcity of high-quality part annotations in existing datasets for diverse object categories.

Method: Created PIN++ dataset with 100 annotated images per ImageNet-1K category, trained part segmentation network to generate pseudo labels, and developed MPM with auxiliary bypass layers supervised by both real and pseudo part annotations.

Result: Enhanced part-based models for robust object recognition and established strong baselines for part segmentation, object segmentation, and few-shot learning tasks.

Conclusion: Part annotations significantly improve model performance across multiple tasks, with PIN++ serving as a valuable resource for part-based computer vision research.

Abstract: To address the scarcity of high-quality part annotations in existing datasets, we introduce PartImageNet++ (PIN++), a dataset that provides detailed part annotations for all categories in ImageNet-1K. With 100 annotated images per category, totaling 100K images, PIN++ represents the most comprehensive dataset covering a diverse range of object categories. Leveraging PIN++, we propose a Multi-scale Part-supervised recognition Model (MPM) for robust classification on ImageNet-1K. We first trained a part segmentation network using PIN++ and used it to generate pseudo part labels for the remaining unannotated images. MPM then integrated a conventional recognition architecture with auxiliary bypass layers, jointly supervised by both pseudo part labels and the original part annotations. Furthermore, we conducted extensive experiments on PIN++, including part segmentation, object segmentation, and few-shot learning, exploring various ways to leverage part annotations in downstream tasks. Experimental results demonstrated that our approach not only enhanced part-based models for robust object recognition but also established strong baselines for multiple downstream tasks, highlighting the potential of part annotations in improving model performance. The dataset and the code are available at https://github.com/LixiaoTHU/PartImageNetPP.

Wentao Bian, Fenglei Xu

Main category: cs.CV

TL;DR: DA-FSS addresses multimodal few-shot 3D point cloud segmentation by decoupling semantic and geometric paths to resolve the Plasticity-Stability Dilemma and semantic blindness issues in existing approaches.

Details

Motivation: The paper identifies two key problems in existing multimodal few-shot 3D point cloud segmentation: (1) the Plasticity-Stability Dilemma in "Fuse-then-Refine" paradigms where early fusion causes conflicts, and (2) CLIP's inter-class confusion leading to semantic blindness in segmentation tasks.

Method: DA-FSS introduces a decoupled architecture with two parallel experts: Geometric Expert (maintains plasticity) and Semantic Expert (ensures stability). It uses a Parallel Expert Refinement module for modal correlations, a Stacked Arbitration Module for fusion and arbitration, and a Decoupled Alignment Module for knowledge transfer without confusion propagation.

Result: Experiments on S3DIS and ScanNet datasets show DA-FSS outperforms MM-FSS baseline. The model achieves superior geometric boundaries, completeness, and texture differentiation while better utilizing multimodal information.

Conclusion: Decoupling semantic and geometric pathways with mutual regularization effectively addresses the Plasticity-Stability Dilemma and semantic blindness in multimodal few-shot 3D segmentation, leading to improved generalization and performance.

Abstract: In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in “Fuse-then-Refine” paradigms: the “Plasticity-Stability Dilemma.” In addition, CLIP’s inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities’ utilization rate and better leverage each modality’s information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA-FSS.

[217] Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

Mingxing Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi

Main category: cs.CV

TL;DR: A method that uses language cues to predict uncertainty-aware calibration envelopes for recovering metric depth from relative-depth models, training only lightweight calibration heads while keeping the backbone frozen.

Details

Motivation: Monocular metric depth estimation remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity, even though relative-depth foundation models transfer well. Existing approaches struggle with noisy language cues that vary with phrasing and missing objects.

Method: Under frozen-backbone calibration, recover metric depth via image-specific affine transform in inverse depth. Use language to predict uncertainty-aware envelope bounding feasible calibration parameters rather than text-only point estimate. Use pooled multi-scale frozen visual features to select image-specific calibration within envelope. Train with closed-form least-squares oracle providing per-image supervision.

Result: Experiments on NYUv2 and KITTI show improved in-domain accuracy. Zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Conclusion: The proposed uncertainty-aware language-guided calibration approach effectively recovers metric depth while maintaining robustness to noisy language cues and domain shifts, outperforming language-only methods.

Abstract: Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

[218] Domain Adaptation of Carotid Ultrasound Images using Generative Adversarial Network

Mohd Usama, Belal Ahmad, Christer Gronlund, Faleh Menawer R Althiyabi

Main category: cs.CV

TL;DR: Proposes a GAN-based domain adaptation method for ultrasound images to handle texture variations and reverberation noise across different imaging devices/settings, improving cross-domain performance without retraining.

Details

Motivation: Medical imaging models often fail when test and training data come from different devices/settings due to texture variations and noise. Retraining for each specific device is costly and impractical.

Method: Formulates domain adaptation as image-to-image translation using a novel GAN-based model that modifies texture patterns and removes reverberation noise while preserving image content.

Result: Successfully translated texture patterns and removed noise in carotid ultrasound images from three domains. Outperformed no adaptation with histogram correlation (0.960 vs 0.916) and Bhattacharya distance (0.040 vs 0.090).

Conclusion: The proposed GAN-based domain adaptation effectively addresses cross-device/setting challenges in ultrasound imaging, enabling models to work across different domains without costly retraining.

Abstract: Deep learning has been extensively used in medical imaging applications, assuming that the test and training datasets belong to the same probability distribution. However, a common challenge arises when working with medical images generated by different systems or even the same system with different parameter settings. Such images contain diverse textures and reverberation noise that violate the aforementioned assumption. Consequently, models trained on data from one device or setting often struggle to perform effectively with data from other devices or settings. In addition, retraining models for each specific device or setting is labor-intensive and costly. To address these issues in ultrasound images, we propose a novel Generative Adversarial Network (GAN)-based model. We formulated the domain adaptation tasks as an image-to-image translation task, in which we modified the texture patterns and removed reverberation noise in the test data images from the source domain to align with those in the target domain images while keeping the image content unchanged. We applied the proposed method to two datasets containing carotid ultrasound images from three different domains. The experimental results demonstrate that the model successfully translated the texture pattern of images and removed reverberation noise from the ultrasound images. Furthermore, we evaluated the CycleGAN approaches for a comparative study with the proposed model. The experimental findings conclusively demonstrated that the proposed model achieved domain adaptation (histogram correlation (0.960 (0.019), & 0.920 (0.043) and bhattacharya distance (0.040 (0.020), & 0.085 (0.048)), compared to no adaptation (0.916 (0.062) & 0.890 (0.077), 0.090 (0.070) & 0.121 (0.095)) for both datasets.

[219] Robust Ship Detection and Tracking Using Modified ViBe and Backwash Cancellation Algorithm

Mohammad Hassan Saghafi, Seyed Majid Noorhosseini, Seyed Abolfazl Seyed Javadein, Hadi Khalili

Main category: cs.CV

TL;DR: Proposed a robust real-time ship detection and tracking method for coastal video using modified ViBe algorithm with backwash cancellation.

Details

Motivation: Coastal scenarios are unpredictable with dynamic properties, requiring robust detection methods that can handle natural sea waves, light variations, and backwash interference.

Method: Modified ViBe algorithm for moving object detection that reduces probability of losing ships, quickly updates background, and uses geometrical properties of ships with brightness distortion concepts for backwash cancellation.

Result: Experimental results demonstrate outstanding performance in ship detection and tracking with real-time and precise operation.

Conclusion: The proposed strategy and methods effectively handle challenging coastal conditions for robust ship detection and tracking in real-time.

Abstract: In this paper, we propose a robust real time detection and tracking method for detecting ships in a coastal video sequences. Since coastal scenarios are unpredictable and scenes have dynamic properties it is essential to apply detection methods that are robust to these conditions. This paper presents modified ViBe for moving object detection which detects ships and backwash. In the modified ViBe the probability of losing ships is decreased in comparison with the original ViBe. It is robust to natural sea waves and variation of lights and is capable of quickly updating the background. Based on geometrical properties of ship and some concepts such as brightness distortion, a new method for backwash cancellation is proposed. Experimental results demonstrate that the proposed strategy and methods have outstanding performance in ship detection and tracking. These results also illustrate real time and precise performance of the proposed strategy.

[220] Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu

Main category: cs.CV

TL;DR: ADPO is a unified RL framework that jointly learns answer generation and self-verification within a single policy, reducing training/inference costs while improving performance across multiple benchmarks.

Details

Motivation: Parallel test-time scaling typically requires separate generation and verification models, which incurs high training and inference costs. There's a need for a more efficient unified approach that can handle both tasks simultaneously.

Method: ADPO introduces two key innovations: 1) Preference verification reward that computes mean verification scores from positive/negative samples as decision thresholds, providing feedback when prediction correctness aligns with answer correctness; 2) Advantage decoupled optimization that computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives to preserve generation quality while calibrating verification scores.

Result: ADPO achieves up to +34.1% higher verification AUC, -53.5% lower inference time, with significant gains: +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.

Conclusion: ADPO successfully demonstrates that joint learning of generation and verification within a single policy can significantly reduce computational costs while improving performance across diverse reasoning and vision-language tasks, offering an efficient alternative to parallel test-time scaling approaches.

Abstract: Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.

[221] Higher-Order Domain Generalization in Magnetic Resonance-Based Assessment of Alzheimer’s Disease

Zobia Batool, Diala Lteif, Vijaya B. Kolachalama, Huseyin Ozkan, Erchan Aptoula

Main category: cs.CV

TL;DR: Extended MixStyle (EM) framework improves Alzheimer’s disease classification across different MRI datasets by blending higher-order feature moments (skewness and kurtosis) to handle domain shifts from varying scanners and protocols.

Details

Motivation: Deep learning models for Alzheimer's disease diagnosis using structural MRI often fail to generalize to new cohorts due to domain shifts from different scanners, protocols, and patient demographics. Single-domain generalization remains underexplored but critical given fragmented AD datasets.

Method: Extended MixStyle (EM) framework that blends higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations, enabling better domain generalization. Trained on NACC dataset (n=4,647) to differentiate normal cognition from mild cognitive impairment or AD.

Result: EM improves cross-domain performance when tested on three unseen cohorts (total n=3,126), enhancing macro-F1 by 2.4 percentage points on average over state-of-the-art single-domain generalization benchmarks.

Conclusion: Extended MixStyle shows promise for invariant, reliable Alzheimer’s disease detection in heterogeneous real-world settings by effectively handling domain shifts through higher-order feature moment blending.

Abstract: Despite progress in deep learning for Alzheimer’s disease (AD) diagnostics, models trained on structural magnetic resonance imaging (sMRI) often do not perform well when applied to new cohorts due to domain shifts from varying scanners, protocols and patient demographics. AD, the primary driver of dementia, manifests through progressive cognitive and neuroanatomical changes like atrophy and ventricular expansion, making robust, generalizable classification essential for real-world use. While convolutional neural networks and transformers have advanced feature extraction via attention and fusion techniques, single-domain generalization (SDG) remains underexplored yet critical, given the fragmented nature of AD datasets. To bridge this gap, we introduce Extended MixStyle (EM), a framework for blending higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations. Trained on sMRI data from the National Alzheimer’s Coordinating Center (NACC; n=4,647) to differentiate persons with normal cognition (NC) from those with mild cognitive impairment (MCI) or AD and tested on three unseen cohorts (total n=3,126), EM yields enhanced cross-domain performance, improving macro-F1 on average by 2.4 percentage points over state-of-the-art SDG benchmarks, underscoring its promise for invariant, reliable AD detection in heterogeneous real-world settings. The source code will be made available upon acceptance at https://github.com/zobia111/Extended-Mixstyle.

[222] DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion

Ziyue Zhang, Luxi Lin, Xiaolin Hu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji

Main category: cs.CV

TL;DR: DeepInv is a self-supervised diffusion inversion method that trains a parameterized solver to predict inversion noise step-by-step without ground-truth noise annotations, achieving superior performance and speed compared to existing methods.

Details

Motivation: Diffusion inversion is crucial for controllable image editing but remains challenging due to lack of viable supervision signals. Existing approximation-based methods sacrifice either performance or efficiency, creating a need for a better solution.

Method: Proposes DeepInv with: 1) self-supervised objective and data augmentation to generate pseudo noises from real images, 2) iterative multi-scale training regime to train a parameterized inversion solver for fast image-to-noise mapping, 3) first trainable solver that predicts inversion noise step-by-step.

Result: Achieves significantly better performance and inference speed: +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Demonstrates superior quantitative and qualitative results across multiple benchmarks.

Conclusion: DeepInv presents an effective self-supervised approach to diffusion inversion that outperforms existing methods in both accuracy and efficiency, offering a novel trainable solver paradigm that provides valuable insights for the research community.

Abstract: Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato-kitty/DeepInv.

[223] DiffKD-DCIS: Predicting Upgrade of Ductal Carcinoma In Situ with Diffusion Augmentation and Knowledge Distillation

Tao Li, Qing Li, Na Li, Hui Xie

Main category: cs.CV

TL;DR: DiffKD-DCIS framework uses conditional diffusion modeling and knowledge distillation to predict DCIS-to-IDC upgrade from ultrasound images, achieving radiologist-level accuracy with computational efficiency.

Details

Motivation: Accurate prediction of DCIS upgrade to IDC is crucial for surgical planning, but traditional deep learning methods struggle with limited ultrasound data and poor generalization.

Method: Three-stage framework: 1) Conditional diffusion model generates synthetic ultrasound images using multimodal conditions for data augmentation, 2) Deep teacher network extracts robust features from both real and synthetic data, 3) Compact student network learns from teacher via knowledge distillation for efficiency.

Result: Synthetic images showed good quality. Student network had fewer parameters and faster inference. On external test sets, it outperformed partial combinations and achieved accuracy comparable to senior radiologists, superior to junior radiologists.

Conclusion: The DiffKD-DCIS framework demonstrates significant clinical potential for predicting DCIS upgrade to IDC, balancing generalization ability with computational efficiency while achieving radiologist-level performance.

Abstract: Accurately predicting the upgrade of ductal carcinoma in situ (DCIS) to invasive ductal carcinoma (IDC) is crucial for surgical planning. However, traditional deep learning methods face challenges due to limited ultrasound data and poor generalization ability. This study proposes the DiffKD-DCIS framework, integrating conditional diffusion modeling with teacher-student knowledge distillation. The framework operates in three stages: First, a conditional diffusion model generates high-fidelity ultrasound images using multimodal conditions for data augmentation. Then, a deep teacher network extracts robust features from both original and synthetic data. Finally, a compact student network learns from the teacher via knowledge distillation, balancing generalization and computational efficiency. Evaluated on a multi-center dataset of 1,435 cases, the synthetic images were of good quality. The student network had fewer parameters and faster inference. On external test sets, it outperformed partial combinations, and its accuracy was comparable to senior radiologists and superior to junior ones, showing significant clinical potential.

[224] A Novel Deep Learning Method for Segmenting the Left Ventricle in Cardiac Cine MRI

Wenhui Chu, Aobo Jin, Hardik A. Gohel

Main category: cs.CV

TL;DR: GBU-Net is a novel deep learning network using group-batch-normalized U-Net architecture for precise left ventricle segmentation in short-axis cine MRI scans, achieving 97% dice score on SunnyBrook dataset.

Details

Motivation: To improve the accuracy of left ventricle segmentation in cardiac MRI scans, which is crucial for surgical robotics and medical analysis, by addressing limitations of traditional CNN-based segmentation that often miss contextual information.

Method: Developed GBU-Net based on group-batch-normalized U-Net framework with down-sampling pathway for feature extraction and up-sampling pathway for detail restoration, specifically enhanced for medical imaging with techniques for better contextual understanding.

Result: GBU-Net significantly outperforms existing methods, achieving 97% dice score on SunnyBrook testing dataset and surpassing standard metrics like dice coefficient and mean perpendicular distance.

Conclusion: GBU-Net offers enhanced precision and contextual understanding in left ventricle segmentation, making it valuable for surgical robotics and medical analysis applications.

Abstract: This research aims to develop a novel deep learning network, GBU-Net, utilizing a group-batch-normalized U-Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short-axis cine MRI scans. The methodology includes a down-sampling pathway for feature extraction and an up-sampling pathway for detail restoration, enhanced for medical imaging. Key modifications include techniques for better contextual understanding crucial in cardiac MRI segmentation. The dataset consists of 805 left ventricular MRI scans from 45 patients, with comparative analysis using established metrics such as the dice coefficient and mean perpendicular distance. GBU-Net significantly improves the accuracy of left ventricle segmentation in cine MRI scans. Its innovative design outperforms existing methods in tests, surpassing standard metrics like the dice coefficient and mean perpendicular distance. The approach is unique in its ability to capture contextual information, often missed in traditional CNN-based segmentation. An ensemble of the GBU-Net attains a 97% dice score on the SunnyBrook testing dataset. GBU-Net offers enhanced precision and contextual understanding in left ventricle segmentation for surgical robotics and medical analysis.

[225] FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation

Gen Li, Peiyu Liu

Main category: cs.CV

TL;DR: VideoSpeculateRAG: An efficient VLM-based RAG framework using speculative decoding and similarity filtering to improve speed and accuracy in knowledge-intensive multimodal tasks.

Details

Motivation: Vision-Language Models struggle with integrating external knowledge efficiently. Current Retrieval-Augmented Generation methods are inefficient and often fail to maintain high answer quality, creating a need for better solutions.

Method: Two key innovations: 1) Speculative decoding pipeline with lightweight draft model generating answer candidates verified by heavyweight model, 2) Similarity-based filtering strategy to correct entity recognition errors in retrieved knowledge.

Result: Achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x.

Conclusion: Combining speculative decoding with retrieval-augmented reasoning enhances efficiency and reliability in complex, knowledge-intensive multimodal tasks.

Abstract: Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.

[226] BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding

Hongbing Li, Linhui Xiao, Zihan Zhao, Qi Shen, Yixiang Huang, Bo Xiao, Zhanyu Ma

Main category: cs.CV

TL;DR: BARE is a bias-aware and reasoning-enhanced framework for one-tower visual grounding that addresses over-entangled multimodal representations and insufficient semantic reasoning through three novel modules.

Details

Motivation: Current one-tower visual grounding architectures suffer from two main limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders comprehension of referential cues.

Method: BARE introduces a mechanism preserving modality-specific features and constructs referential semantics through three modules: language salience modulator, visual bias correction, and referential relationship enhancement to mitigate multimodal distractions and enhance referential comprehension.

Result: Extensive experiments on five benchmarks demonstrate that BARE achieves state-of-the-art performance while delivering superior computational efficiency compared to existing approaches.

Conclusion: BARE effectively addresses the limitations of current one-tower visual grounding methods by reducing modality biases and enhancing semantic reasoning, resulting in both improved performance and computational efficiency.

Abstract: Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at https://github.com/Marloweeee/BARE.

[227] DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander

Main category: cs.CV

TL;DR: DrivingGen is the first comprehensive benchmark for generative driving world models that addresses limitations in current evaluations by introducing diverse datasets and new metrics for visual realism, trajectory plausibility, temporal coherence, and controllability.

Details

Motivation: The field of driving world models lacks rigorous benchmarks despite growing research. Existing evaluations have critical gaps: generic video metrics overlook safety factors, trajectory plausibility is rarely quantified, temporal/agent consistency is neglected, controllability is ignored, and current datasets lack diversity for real-world deployment.

Method: Created DrivingGen benchmark combining: 1) Diverse evaluation dataset curated from driving datasets and internet-scale video sources covering varied weather, time of day, geographic regions, and complex maneuvers; 2) Suite of new metrics assessing visual realism, trajectory plausibility, temporal coherence, and controllability.

Result: Benchmarked 14 state-of-the-art models revealing clear trade-offs: general models look visually better but break physics, while driving-specific models capture motion realistically but lag in visual quality.

Conclusion: DrivingGen provides a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making in autonomous driving.

Abstract: Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

[228] Improving Flexible Image Tokenizers for Autoregressive Image Generation

Zixuan Fu, Lanqing Guo, Chong Wang, Binbin Song, Ding Liu, Bihan Wen

Main category: cs.CV

TL;DR: ReToK is a flexible image tokenizer that uses redundant token padding and hierarchical semantic regularization to better distribute image information across all tokens, improving autoregressive image generation performance.

Details

Motivation: Current flexible image tokenizers using nested dropout concentrate image information in early tokens, limiting effectiveness for autoregressive image generation as token length increases.

Method: Proposes ReToK with two key components: 1) Redundant Token Padding to activate tail tokens more frequently, and 2) Hierarchical Semantic Regularization that aligns earlier tokens with pre-trained vision foundation features while reducing regularization strength toward tail tokens.

Result: Achieves superior generation performance on ImageNet 256×256 compared to both flexible and fixed-length tokenizers.

Conclusion: ReToK effectively overcomes limitations of existing flexible tokenizers by better distributing image information across all tokens, enabling enhanced latent modeling for autoregressive image generation.

Abstract: Flexible image tokenizers aim to represent an image using an ordered 1D variable-length token sequence. This flexible tokenization is typically achieved through nested dropout, where a portion of trailing tokens is randomly truncated during training, and the image is reconstructed using the remaining preceding sequence. However, this tail-truncation strategy inherently concentrates the image information in the early tokens, limiting the effectiveness of downstream AutoRegressive (AR) image generation as the token length increases. To overcome these limitations, we propose \textbf{ReToK}, a flexible tokenizer with \underline{Re}dundant \underline{Tok}en Padding and Hierarchical Semantic Regularization, designed to fully exploit all tokens for enhanced latent modeling. Specifically, we introduce \textbf{Redundant Token Padding} to activate tail tokens more frequently, thereby alleviating information over-concentration in the early tokens. In addition, we apply \textbf{Hierarchical Semantic Regularization} to align the decoding features of earlier tokens with those from a pre-trained vision foundation model, while progressively reducing the regularization strength toward the tail to allow finer low-level detail reconstruction. Extensive experiments demonstrate the effectiveness of ReTok: on ImageNet 256$\times$256, our method achieves superior generation performance compared with both flexible and fixed-length tokenizers. Code will be available at: \href{https://github.com/zfu006/ReTok}{https://github.com/zfu006/ReTok}

[229] FAR-AMTN: Attention Multi-Task Network for Face Attribute Recognition

Gong Gao, Zekai Wang, Xianhui Liu, Weidong Zhao

Main category: cs.CV

TL;DR: FAR-AMTN: An attention-based multi-task network for face attribute recognition that reduces parameters while improving accuracy through weight-sharing attention, cross-group feature fusion, and dynamic task weighting.

Details

Motivation: Traditional multi-task networks for face attribute recognition suffer from exponential parameter growth with added tasks and limited high-level feature interaction, which hinders exploration of semantic relations among attributes and negatively affects generalization performance.

Method: Proposes FAR-AMTN with three key components: 1) Weight-Shared Group-Specific Attention (WSGSA) module with shared parameters to reduce complexity while improving group feature representation, 2) Cross-Group Feature Fusion (CGFF) module to enable interactions between attribute groups, and 3) Dynamic Weighting Strategy (DWS) for synchronized task convergence.

Result: Experiments on CelebA and LFWA datasets show that FAR-AMTN achieves superior accuracy with significantly fewer parameters compared to existing models.

Conclusion: The proposed FAR-AMTN effectively addresses parameter explosion and limited feature interaction in traditional multi-task networks, demonstrating improved generalization for face attribute recognition through attention mechanisms, cross-group fusion, and dynamic task weighting.

Abstract: To enhance the generalization performance of Multi-Task Networks (MTN) in Face Attribute Recognition (FAR), it is crucial to share relevant information across multiple related prediction tasks effectively. Traditional MTN methods create shared low-level modules and distinct high-level modules, causing an exponential increase in model parameters with the addition of tasks. This approach also limits feature interaction at the high level, hindering the exploration of semantic relations among attributes, thereby affecting generalization negatively. In response, this study introduces FAR-AMTN, a novel Attention Multi-Task Network for FAR. It incorporates a Weight-Shared Group-Specific Attention (WSGSA) module with shared parameters to minimize complexity while improving group feature representation. Furthermore, a Cross-Group Feature Fusion (CGFF) module is utilized to foster interactions between attribute groups, enhancing feature learning. A Dynamic Weighting Strategy (DWS) is also introduced for synchronized task convergence. Experiments on the CelebA and LFWA datasets demonstrate that the proposed FAR-AMTN demonstrates superior accuracy with significantly fewer parameters compared to existing models.

[230] EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding

Tianjun Gu, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan

Main category: cs.CV

TL;DR: The paper introduces Teleo-Spatial Intelligence (TSI), a new paradigm combining physical-dynamic reasoning and intent-driven reasoning, along with EscherVerse benchmark/dataset/models to advance spatial intelligence research.

Details

Motivation: Current spatial reasoning research overlooks human intent behind spatial changes, focusing only on physical dynamics without understanding the goals and purposes driving those changes.

Method: Proposed Teleo-Spatial Intelligence (TSI) paradigm with two pillars: Physical-Dynamic Reasoning and Intent-Driven Reasoning. Created EscherVerse including Escher-Bench benchmark, Escher-35k dataset from real-world videos, and Escher series models with novel data curation pipeline.

Result: EscherVerse provides first benchmark to systematically assess Intent-Driven Reasoning, evaluating object permanence, state transitions, trajectory prediction in dynamic human-centric scenarios, moving beyond constrained settings.

Conclusion: The work advances spatial intelligence from passive scene description toward holistic, purpose-driven understanding of the world, providing foundational resource for TSI research.

Abstract: The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning–understanding the physical principles of object interactions–and Intent-Driven Reasoning–inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent’s ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.

[231] Guiding Token-Sparse Diffusion Models

Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

Main category: cs.CV

TL;DR: Sparse Guidance (SG) improves inference quality for sparsely trained diffusion models by using token-level sparsity instead of conditional dropout, achieving better fidelity with fewer FLOPs.

Details

Motivation: Sparsely trained diffusion models are cheaper to train but struggle during inference due to poor response to Classifier-free Guidance (CFG), leading to underwhelming performance.

Method: Proposes Sparse Guidance (SG) which uses token-level sparsity instead of conditional dropout to guide diffusion models, preserving high-variance of conditional predictions better.

Result: Achieves 1.58 FID on ImageNet-256 with 25% fewer FLOPs, up to 58% FLOP savings at matched baseline quality, and trains a 2.5B text-to-image model with improved composition and human preference scores.

Conclusion: Sparse Guidance enables efficient inference for sparsely trained diffusion models while maintaining or improving output quality, making diffusion models more practical for real-world applications.

Abstract: Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

[232] CAP-IQA: Context-Aware Prompt-Guided CT Image Quality Assessment

Kazi Ramisa Rifa, Jie Zhang, Abdullah Imran

Main category: cs.CV

TL;DR: CAP-IQA framework integrates text-level priors with instance-level context prompts and causal debiasing for CT image quality assessment, outperforming existing methods on benchmark datasets.

Details

Motivation: Prompt-based methods for CT IQA are under-explored and often introduce bias by reflecting idealized definitions that don't hold under real-world degradations like noise, motion artifacts, or scanner variability.

Method: Proposes Context-Aware Prompt-guided IQA (CAP-IQA) framework that integrates text-level priors with instance-level context prompts, applies causal debiasing, combines CNN-based visual encoder with domain-specific text encoder, and uses radiology-style prompts with context-aware fusion.

Result: Achieves overall correlation score of 2.8590 on 2023 LDCTIQA challenge benchmark, surpassing top-ranked team by 4.24%. Also demonstrates generalizability on in-house dataset of 91,514 pediatric CT images.

Conclusion: CAP-IQA effectively addresses bias in prompt-based CT IQA through context-aware fusion and causal debiasing, showing superior performance and generalizability across different patient populations.

Abstract: Prompt-based methods, which encode medical priors through descriptive text, have been only minimally explored for CT Image Quality Assessment (IQA). While such prompts can embed prior knowledge about diagnostic quality, they often introduce bias by reflecting idealized definitions that may not hold under real-world degradations such as noise, motion artifacts, or scanner variability. To address this, we propose the Context-Aware Prompt-guided Image Quality Assessment (CAP-IQA) framework, which integrates text-level priors with instance-level context prompts and applies causal debiasing to separate idealized knowledge from factual, image-specific degradations. Our framework combines a CNN-based visual encoder with a domain-specific text encoder to assess diagnostic visibility, anatomical clarity, and noise perception in abdominal CT images. The model leverages radiology-style prompts and context-aware fusion to align semantic and perceptual representations. On the 2023 LDCTIQA challenge benchmark, CAP-IQA achieves an overall correlation score of 2.8590 (sum of PLCC, SROCC, and KROCC), surpassing the top-ranked leaderboard team (2.7427) by 4.24%. Moreover, our comprehensive ablation experiments confirm that prompt-guided fusion and the simplified encoder-only design jointly enhance feature alignment and interpretability. Furthermore, evaluation on an in-house dataset of 91,514 pediatric CT images demonstrates the true generalizability of CAP-IQA in assessing perceptual fidelity in a different patient population.

[233] An Empirical Study of Monocular Human Body Measurement Under Weak Calibration

Gaurav Sekar

Main category: cs.CV

TL;DR: Empirical study comparing three weakly calibrated monocular methods for human body measurement from RGB images, focusing on calibration assumptions rather than SOTA accuracy.

Details

Motivation: Human body measurement from monocular RGB images is challenging due to scale ambiguity, viewpoint sensitivity, and lack of depth information. The paper aims to provide empirical insights for lightweight systems on consumer devices.

Method: Systematic empirical study of three weakly calibrated monocular strategies: 1) landmark-based geometry, 2) pose-driven regression, and 3) object-calibrated silhouettes. Evaluated under semi-constrained conditions using consumer-grade cameras.

Result: Reveals clear trade-off between user effort during calibration and stability of resulting circumferential quantities. Analyzes how different calibration assumptions influence measurement behavior, robustness, and failure modes across varied body types.

Conclusion: The paper serves as an empirical design reference for lightweight monocular human measurement systems intended for deployment on consumer devices, providing insights into calibration trade-offs rather than pursuing state-of-the-art accuracy.

Abstract: Estimating human body measurements from monocular RGB imagery remains challenging due to scale ambiguity, viewpoint sensitivity, and the absence of explicit depth information. This work presents a systematic empirical study of three weakly calibrated monocular strategies: landmark-based geometry, pose-driven regression, and object-calibrated silhouettes, evaluated under semi-constrained conditions using consumer-grade cameras. Rather than pursuing state-of-the-art accuracy, the study analyzes how differing calibration assumptions influence measurement behavior, robustness, and failure modes across varied body types. The results reveal a clear trade-off between user effort during calibration and the stability of resulting circumferential quantities. This paper serves as an empirical design reference for lightweight monocular human measurement systems intended for deployment on consumer devices.

[234] Animated 3DGS Avatars in Diverse Scenes with Consistent Lighting and Shadows

Aymen Mir, Riza Alp Guler, Jian Wang, Gerard Pons-Moll, Bing Zhou

Main category: cs.CV

TL;DR: Deep Gaussian Shadow Maps (DGSM) enable consistent lighting and shadows for animated 3D Gaussian Splatting avatars interacting with 3DGS scenes, using volumetric shadow computation without meshing, combined with spherical harmonic relighting.

Details

Motivation: There's a need for consistent lighting and shadows when animated 3D Gaussian Splatting avatars interact with 3DGS scenes or dynamic objects in static scenes, avoiding the limitations of traditional meshing approaches.

Method: 1) Deep Gaussian Shadow Maps (DGSM): A modern shadow mapping algorithm tailored to volumetric 3DGS representation that computes transmittance over concentric radial shells stored in octahedral atlases for real-time GPU sampling. 2) Spherical Harmonic Relighting: Approximates local environment illumination with HDRI probes in SH basis and applies fast per-Gaussian radiance transfer without explicit BRDF estimation.

Result: Demonstrated environment-consistent lighting for avatars from AvatarX and ActorsHQ composited into ScanNet++, DL3DV, and SuperSplat scenes, showing interactions with inserted objects. The system works fully in volumetric 3DGS representation, yielding coherent shadows and relighting without meshing.

Conclusion: DGSM and SH relighting provide a complete solution for consistent lighting and shadows in 3D Gaussian Splatting scenes, enabling realistic avatar interactions while maintaining the volumetric representation advantages and avoiding meshing requirements.

Abstract: We present a method for consistent lighting and shadows when animated 3D Gaussian Splatting (3DGS) avatars interact with 3DGS scenes or with dynamic objects inserted into otherwise static scenes. Our key contribution is Deep Gaussian Shadow Maps (DGSM), a modern analogue of the classical shadow mapping algorithm tailored to the volumetric 3DGS representation. Building on the classic deep shadow mapping idea, we show that 3DGS admits closed form light accumulation along light rays, enabling volumetric shadow computation without meshing. For each estimated light, we tabulate transmittance over concentric radial shells and store them in octahedral atlases, which modern GPUs can sample in real time per query to attenuate affected scene Gaussians and thus cast and receive shadows consistently. To relight moving avatars, we approximate the local environment illumination with HDRI probes represented in a spherical harmonic (SH) basis and apply a fast per Gaussian radiance transfer, avoiding explicit BRDF estimation or offline optimization. We demonstrate environment consistent lighting for avatars from AvatarX and ActorsHQ, composited into ScanNet++, DL3DV, and SuperSplat scenes, and show interactions with inserted objects. Across single and multi avatar settings, DGSM and SH relighting operate fully in the volumetric 3DGS representation, yielding coherent shadows and relighting while avoiding meshing.

[235] LabelAny3D: Label Any Object 3D in the Wild

Jin Yao, Radowan Mahmud Redoy, Sebastian Elbaum, Matthew B. Dwyer, Zezhou Cheng

Main category: cs.CV

TL;DR: LabelAny3D is an analysis-by-synthesis framework that generates high-quality 3D bounding box annotations from 2D images, enabling creation of COCO3D benchmark for open-vocabulary monocular 3D detection.

Details

Motivation: Existing monocular 3D detection models struggle with in-the-wild images due to lack of 3D datasets and challenges of 3D annotation. There's a need for scalable 3D recognition in realistic, open-world settings.

Method: LabelAny3D uses an analysis-by-synthesis framework to reconstruct holistic 3D scenes from 2D images, efficiently producing high-quality 3D bounding box annotations. Built on this pipeline, they create COCO3D benchmark from MS-COCO dataset.

Result: Annotations generated by LabelAny3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. COCO3D covers wide range of object categories absent from existing 3D datasets.

Conclusion: The results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings, addressing the data scarcity problem in monocular 3D detection.

Abstract: Detecting objects in 3D space from monocular input is crucial for applications ranging from robotics to scene understanding. Despite advanced performance in the indoor and autonomous driving domains, existing monocular 3D detection models struggle with in-the-wild images due to the lack of 3D in-the-wild datasets and the challenges of 3D annotation. We introduce LabelAny3D, an \emph{analysis-by-synthesis} framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations. Built on this pipeline, we present COCO3D, a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset and covering a wide range of object categories absent from existing 3D datasets. Experiments show that annotations generated by LabelAny3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. These results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings.

[236] Trustworthy Data-Driven Wildfire Risk Prediction and Understanding in Western Canada

Zhengsen Xu, Lanying Wang, Sibo Cheng, Xue Rui, Kyle Gao, Yimin Zhu, Mabel Heffring, Zack Dewis, Saeid Taleghanidoozdoozan, Megan Greenwood, Motasem Alkayid, Quinn Ledingham, Hongjie He, Jonathan Li, Lincoln Linlin Xu

Main category: cs.CV

TL;DR: A trustworthy data-driven framework for wildfire risk prediction in western Canada using multi-scale temporal modeling with uncertainty quantification and interpretability, achieving high accuracy (F1=0.90) during record-breaking 2023-2024 fire seasons.

Details

Motivation: Wildfire intensification in western Canada causes substantial losses, but accurate prediction is challenging due to stochastic ignition/spread and complex nonlinear interactions among multiple drivers, limiting reliability and interpretability of purely data-driven models.

Method: A trustworthy data-driven wildfire risk prediction framework based on long-sequence, multi-scale temporal modeling that integrates heterogeneous drivers while explicitly quantifying predictive uncertainty and enabling process-level interpretation using SHAP analysis.

Result: The model outperforms existing time-series approaches, achieving F1 score of 0.90 and PR-AUC of 0.98 with low computational cost. Uncertainty analysis reveals spatial/seasonal patterns in predictive confidence, and SHAP interpretation shows temperature drivers dominate risk in both years, with moisture constraints playing stronger role in 2024 spatial contrasts.

Conclusion: The proposed framework provides accurate, computationally efficient wildfire risk prediction with uncertainty quantification and mechanistic interpretation, offering valuable insights for wildfire management and demonstrating the importance of trustworthy AI for environmental risk assessment.

Abstract: In recent decades, the intensification of wildfire activity in western Canada has resulted in substantial socio-economic and environmental losses. Accurate wildfire risk prediction is hindered by the intrinsic stochasticity of ignition and spread and by nonlinear interactions among fuel conditions, meteorology, climate variability, topography, and human activities, challenging the reliability and interpretability of purely data-driven models. We propose a trustworthy data-driven wildfire risk prediction framework based on long-sequence, multi-scale temporal modeling, which integrates heterogeneous drivers while explicitly quantifying predictive uncertainty and enabling process-level interpretation. Evaluated over western Canada during the record-breaking 2023 and 2024 fire seasons, the proposed model outperforms existing time-series approaches, achieving an F1 score of 0.90 and a PR-AUC of 0.98 with low computational cost. Uncertainty-aware analysis reveals structured spatial and seasonal patterns in predictive confidence, highlighting increased uncertainty associated with ambiguous predictions and spatiotemporal decision boundaries. SHAP-based interpretation provides mechanistic understanding of wildfire controls, showing that temperature-related drivers dominate wildfire risk in both years, while moisture-related constraints play a stronger role in shaping spatial and land-cover-specific contrasts in 2024 compared to the widespread hot and dry conditions of 2023. Data and code are available at https://github.com/SynUW/mmFire.

[237] Evaluating Deep Learning-Based Face Recognition for Infants and Toddlers: Impact of Age Across Developmental Stages

Afzal Hossain, Mst Rumana Sumi, Stephanie Schuckers

Main category: cs.CV

TL;DR: This paper evaluates deep learning face recognition models on infants/toddlers (0-3 years), finding poor performance in early months (30.7% TAR) that improves with age (64.7% TAR at 2.5-3 years). A DANN approach reduces temporal drift, improving TAR by 12%.

Details

Motivation: Face recognition for infants/toddlers faces unique challenges: rapid facial changes, high similarity between children, and limited datasets. This is critical for smart city applications like healthcare, child safety, and digital identity services where reliable biometric systems over time are needed.

Method: Evaluated four deep learning models (FaceNet, ArcFace, MagFace, CosFace) on a longitudinal dataset collected over 24 months with 7 sessions. Analyzed recognition accuracy across developmental stages and time intervals. Applied Domain Adversarial Neural Network (DANN) to mitigate temporal embedding drift.

Result: Performance was poor for infants 0-6 months (30.7% TAR at 0.1% FAR) due to unstable facial features. Improved significantly with age (64.7% TAR at 0.1% FAR for 2.5-3 years). Shorter time gaps yielded higher accuracy. DANN improved TAR by over 12%, creating more temporally stable features.

Conclusion: Infant/toddler face recognition is challenging but improves with age. Temporal drift is a major issue that can be mitigated with domain adaptation techniques like DANN. Future research should focus on privacy-preserving biometric systems that handle temporal variability for secure urban applications.

Abstract: Face recognition for infants and toddlers presents unique challenges due to rapid facial morphology changes, high inter-class similarity, and limited dataset availability. This study evaluates the performance of four deep learning-based face recognition models FaceNet, ArcFace, MagFace, and CosFace on a newly developed longitudinal dataset collected over a 24 month period in seven sessions involving children aged 0 to 3 years. Our analysis examines recognition accuracy across developmental stages, showing that the True Accept Rate (TAR) is only 30.7% at 0.1% False Accept Rate (FAR) for infants aged 0 to 6 months, due to unstable facial features. Performance improves significantly in older children, reaching 64.7% TAR at 0.1% FAR in the 2.5 to 3 year age group. We also evaluate verification performance over different time intervals, revealing that shorter time gaps result in higher accuracy due to reduced embedding drift. To mitigate this drift, we apply a Domain Adversarial Neural Network (DANN) approach that improves TAR by over 12%, yielding features that are more temporally stable and generalizable. These findings are critical for building biometric systems that function reliably over time in smart city applications such as public healthcare, child safety, and digital identity services. The challenges observed in early age groups highlight the importance of future research on privacy preserving biometric authentication systems that can address temporal variability, particularly in secure and regulated urban environments where child verification is essential.

[238] FALCON: Few-Shot Adversarial Learning for Cross-Domain Medical Image Segmentation

Abdur R. Fayjie, Pankhi Kashyap, Jutika Borah, Patrick Vandewalle

Main category: cs.CV

TL;DR: FALCON is a cross-domain few-shot 3D medical segmentation framework that processes data as 2D slices, achieving superior boundary accuracy with minimal labeled data and computational overhead.

Details

Motivation: Accurate 3D medical segmentation is crucial for diagnosis and treatment but faces challenges: scarcity of 3D annotations, patient variability, data privacy concerns, and high computational costs.

Method: Meta-training on natural images to learn generalizable segmentation priors, then adversarial fine-tuning and boundary-aware learning for medical domain adaptation. Task-aware inference adapts dynamically to patient-specific variations across slices.

Result: Consistently achieves lowest Hausdorff Distance scores (superior boundary accuracy) while maintaining comparable Dice Similarity Coefficient to state-of-the-art models on four benchmarks, with significantly less labeled data, no augmentation, and lower computational overhead.

Conclusion: FALCON enables clinically viable 3D medical segmentation by addressing key practical challenges through cross-domain few-shot learning, achieving precise boundary delineation with minimal data and computational requirements.

Abstract: Precise delineation of anatomical and pathological structures within 3D medical volumes is crucial for accurate diagnosis, effective surgical planning, and longitudinal disease monitoring. Despite advancements in AI, clinically viable segmentation is often hindered by the scarcity of 3D annotations, patient-specific variability, data privacy concerns, and substantial computational overhead. In this work, we propose FALCON, a cross-domain few-shot segmentation framework that achieves high-precision 3D volume segmentation by processing data as 2D slices. The framework is first meta-trained on natural images to learn-to-learn generalizable segmentation priors, then transferred to the medical domain via adversarial fine-tuning and boundary-aware learning. Task-aware inference, conditioned on support cues, allows FALCON to adapt dynamically to patient-specific anatomical variations across slices. Experiments on four benchmarks demonstrate that FALCON consistently achieves the lowest Hausdorff Distance scores, indicating superior boundary accuracy while maintaining a Dice Similarity Coefficient comparable to the state-of-the-art models. Notably, these results are achieved with significantly less labeled data, no data augmentation, and substantially lower computational overhead.

[239] Mitigating Longitudinal Performance Degradation in Child Face Recognition Using Synthetic Data

Afzal Hossain, Stephanie Schuckers

Main category: cs.CV

TL;DR: Synthetic face data improves child face recognition by reducing verification errors over time through synthetic-augmented fine-tuning.

Details

Motivation: Child face recognition is challenging due to rapid facial growth causing template drift and increasing verification errors over time. The paper investigates whether synthetic face data can act as a longitudinal stabilizer to improve temporal robustness.

Method: Three settings evaluated on YFA dataset: (1) pretrained MagFace embeddings without fine-tuning, (2) MagFace fine-tuned with authentic faces only, (3) MagFace fine-tuned with combination of authentic and synthetic faces. Synthetic data generated using StyleGAN2 ADA with post-generation filtering to prevent identity leakage and remove artifacts.

Result: Synthetic-augmented fine-tuning substantially reduces error rates across enrollment verification gaps from 6 to 36 months compared to both pretrained baseline and real-only fine-tuning.

Conclusion: Synthetic augmentation provides a risk-aware approach to improving identity persistence in pediatric face recognition by acting as a longitudinal stabilizer against facial growth-related template drift.

Abstract: Longitudinal face recognition in children remains challenging due to rapid and nonlinear facial growth, which causes template drift and increasing verification errors over time. This work investigates whether synthetic face data can act as a longitudinal stabilizer by improving temporal robustness of child face recognition models. Using an identity disjoint protocol on the Young Face Aging (YFA) dataset, we evaluate three settings: (i) pretrained MagFace embeddings without dataset specific fine-tuning, (ii) MagFace fine-tuned using authentic training faces only, and (iii) MagFace fine-tuned using a combination of authentic and synthetically generated training faces. Synthetic data is generated using StyleGAN2 ADA and incorporated exclusively within the training identities; a post generation filtering step is applied to mitigate identity leakage and remove artifact affected samples. Experimental results across enrollment verification gaps from 6 to 36 months show that synthetic-augmented fine tuning substantially reduces error rates relative to both the pretrained baseline and real only fine tuning. These findings provide a risk aware assessment of synthetic augmentation for improving identity persistence in pediatric face recognition.

[240] Learnability-Driven Submodular Optimization for Active Roadside 3D Detection

Ruiyu Mao, Baoming Zhang, Nicholas Ruozzi, Yunhui Guo

Main category: cs.CV

TL;DR: Active learning framework for roadside monocular 3D object detection that selects scenes both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage.

Details

Motivation: Real-world roadside perception deployment often requires annotation of roadside-only data due to hardware and privacy constraints, but many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view. These inherently ambiguous samples increase annotation difficulty and cost, revealing a fundamental learnability problem.

Method: Proposes LH3D, a learnability-driven active learning framework for roadside monocular 3D object detection. The method selects scenes that are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage of the data distribution.

Result: LH3D achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I dataset, significantly outperforming uncertainty-based baselines.

Conclusion: Learnability, not uncertainty, matters for roadside 3D perception. The proposed framework effectively reduces wasted annotation effort on inherently ambiguous samples while obtaining high-performing models through strategic sample selection.

Abstract: Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadside-only data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle–roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.

[241] Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems

Yian Liu, Xiong Wang, Ping Xu, Lei Zhu, Ming Yan, Linyun Xue

Main category: cs.CV

TL;DR: The paper proposes a Covariance Distribution Optimization (CDO) module that improves lane detection accuracy for embedded systems by aligning feature distributions with ground truth, without increasing computational complexity.

Details

Motivation: Real-time lane detection in embedded systems faces challenges due to subtle visual signals in RGB images and constraints of limited computational resources and power consumption. Existing deep learning models lack universally applicable optimization techniques for low-power embedded environments.

Method: Proposes a Covariance Distribution Optimization (CDO) module that aligns lane feature distributions closely with ground-truth labels to enhance detection accuracy without increasing computational complexity. The module can be easily integrated into existing systems without structural modifications.

Result: Evaluated on six diverse models across segmentation-based, anchor-based, and curve-based methods, including real-time optimized and SOTA models. Tested on CULane, TuSimple, and LLAMAS datasets, showing accuracy improvements ranging from 0.01% to 1.5%.

Conclusion: The CDO module offers substantial benefits in performance, power efficiency, and operational flexibility for embedded systems, using existing model parameters to facilitate ongoing training while maintaining computational efficiency.

Abstract: Real-time lane detection in embedded systems encounters significant challenges due to subtle and sparse visual signals in RGB images, often constrained by limited computational resources and power consumption. Although deep learning models for lane detection categorized into segmentation-based, anchor-based, and curve-based methods there remains a scarcity of universally applicable optimization techniques tailored for low-power embedded environments. To overcome this, we propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications. The CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity. Evaluations were conducted on six diverse models across all three method categories, including two optimized for real-time applications and four state-of-the-art (SOTA) models, tested comprehensively on three major datasets: CULane, TuSimple, and LLAMAS. Experimental results demonstrate accuracy improvements ranging from 0.01% to 1.5%. The proposed CDO module is characterized by ease of integration into existing systems without structural modifications and utilizes existing model parameters to facilitate ongoing training, thus offering substantial benefits in performance, power efficiency, and operational flexibility in embedded systems.

[242] FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

Xijie Huang, Chengming Xu, Donghao Luo, Xiaobin Hu, Peng Tang, Xu Peng, Jiangning Zhang, Chengjie Wang, Yanwei Fu

Main category: cs.CV

TL;DR: The paper introduces FFP-300K, a large-scale dataset for First-Frame Propagation video editing, and proposes a guidance-free framework with Adaptive Spatio-Temporal RoPE and self-distillation to resolve appearance-motion tension.

Details

Motivation: Existing First-Frame Propagation methods rely on cumbersome run-time guidance due to inadequate training datasets that are too short, low-resolution, and lack task diversity, preventing robust temporal priors learning.

Method: 1) Created FFP-300K dataset with 300K high-fidelity video pairs (720p, 81 frames) via two-track pipeline for diverse edits. 2) Proposed guidance-free framework with Adaptive Spatio-Temporal RoPE to disentangle appearance/motion references. 3) Used self-distillation with identity propagation as regularizer for temporal stability.

Result: Significantly outperforms existing academic and commercial models on EditVerseBench benchmark, achieving ~0.2 PickScore and ~0.3 VLM score improvements against competitors.

Conclusion: The proposed dataset and framework enable true guidance-free First-Frame Propagation by resolving the appearance-motion tension through architectural innovations and self-distillation, achieving state-of-the-art performance in controllable video editing.

Abstract: First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.

[243] Point-SRA: Self-Representation Alignment for 3D Representation Learning

Lintong Wei, Jian Lu, Haozhe Cheng, Jihua Zhu, Kaibing Zhang

Main category: cs.CV

TL;DR: Point-SRA improves 3D representation learning by using multi-level masking ratios and probabilistic modeling with MeanFlow Transformer, achieving state-of-the-art performance on various 3D tasks.

Details

Motivation: Existing masked autoencoder methods for 3D point clouds have limitations: fixed mask ratios ignore multi-level correlations, and point-wise reconstruction assumptions conflict with point cloud diversity. There's a need to better capture geometric structures and semantic information.

Method: Proposes Point-SRA with: 1) Different masking ratios in MAE to capture complementary geometric/semantic information, 2) MeanFlow Transformer using cross-modal conditional embeddings for diverse probabilistic reconstruction, 3) Dual Self-Representation Alignment at both MAE and MFT levels, 4) Flow-Conditioned Fine-Tuning Architecture to leverage learned point cloud distributions.

Result: Outperforms Point-MAE by 5.37% on ScanObjectNN; achieves 96.07% mean IoU for arteries and 86.87% for aneurysms in intracranial aneurysm segmentation; reaches 47.3% AP@50 for 3D object detection, surpassing MaskPoint by 5.12%.

Conclusion: Point-SRA effectively addresses limitations of existing MAE methods by capturing multi-level representations through self-distillation and probabilistic modeling, demonstrating superior performance across diverse 3D tasks including classification, segmentation, and detection.

Abstract: Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.

[244] MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement

Lei Zhu, Lijian Lin, Ye Zhu, Jiahao Wu, Xuehan Hou, Yu Li, Yunfei Liu, Jie Chen

Main category: cs.CV

TL;DR: MANGO: A two-stage framework for generating realistic 3D conversational avatars using pure image-level supervision, addressing limitations of existing pseudo-3D label methods.

Details

Motivation: Current audio-driven 3D head generation lacks natural bidirectional listen-and-speak interaction and relies on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics in conversational settings.

Method: Two-stage framework: 1) Diffusion-based transformer with dual-audio interaction module models 3D motion from multi-speaker audio, 2) Fast 3D Gaussian Renderer generates high-fidelity images for 2D photometric supervision via alternate training to mitigate pseudo-3D label noise.

Result: Method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing fidelity and controllability of audio-driven talking heads, validated through extensive experiments.

Conclusion: MANGO framework successfully addresses conversational avatar generation challenges by leveraging pure image-level supervision and alternate training, enabling seamless bidirectional listen-and-speak interactions with high-fidelity 3D motion.

Abstract: Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.

[245] CTIS-QA: Clinical Template-Informed Slide-level Question Answering for Pathology

Hao Lu, Ziniu Qian, Yifu Li, Yang Zhou, Bingzheng Wei, Yan Xu

Main category: cs.CV

TL;DR: A clinical pathology template-based pipeline for extracting structured diagnostic information from pathology reports, used to create vision-language datasets and a slide-level QA model that outperforms SOTA methods.

Details

Motivation: To systematically collect and structure pathological information from pathology reports in a standardized way, enabling better vision-language alignment and clinically grounded slide understanding for diagnostic workflows.

Method: 1) Design Clinical Pathology Report Template (CPRT) based on CAP Cancer Protocols; 2) Extract pathological features from reports; 3) Build CTIS-Align dataset (80k slide-description pairs) and CTIS-Bench VQA benchmark (977 WSIs, 14,879 QA pairs); 4) Propose CTIS-QA model with dual-stream architecture (global context via clustering-based aggregation + local regions via attention-guided patch perception).

Result: CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks. Validation performed on TCGA-BRCA dataset.

Conclusion: The proposed pipeline enables comprehensive and standardized extraction of diagnostic elements, and the CTIS-QA model effectively mimics pathologists’ diagnostic approach, demonstrating superior performance in clinically grounded slide understanding tasks.

Abstract: In this paper, we introduce a clinical diagnosis template-based pipeline to systematically collect and structure pathological information. In collaboration with pathologists and guided by the the College of American Pathologists (CAP) Cancer Protocols, we design a Clinical Pathology Report Template (CPRT) that ensures comprehensive and standardized extraction of diagnostic elements from pathology reports. We validate the effectiveness of our pipeline on TCGA-BRCA. First, we extract pathological features from reports using CPRT. These features are then used to build CTIS-Align, a dataset of 80k slide-description pairs from 804 WSIs for vision-language alignment training, and CTIS-Bench, a rigorously curated VQA benchmark comprising 977 WSIs and 14,879 question-answer pairs. CTIS-Bench emphasizes clinically grounded, closed-ended questions (e.g., tumor grade, receptor status) that reflect real diagnostic workflows, minimize non-visual reasoning, and require genuine slide understanding. We further propose CTIS-QA, a Slide-level Question Answering model, featuring a dual-stream architecture that mimics pathologists’ diagnostic approach. One stream captures global slide-level context via clustering-based feature aggregation, while the other focuses on salient local regions through attention-guided patch perception module. Extensive experiments on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks show that CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics. Code and data are available at https://github.com/HLSvois/CTIS-QA.

[246] Subimage Overlap Prediction: Task-Aligned Self-Supervised Pretraining For Semantic Segmentation In Remote Sensing Imagery

Lakshay Sharma, Alex Marin

Main category: cs.CV

TL;DR: Subimage Overlap Prediction is a novel SSL method for remote sensing segmentation that uses less pretraining data by predicting sub-image locations within original images.

Details

Motivation: Most SSL methods require vast pretraining data, which is problematic for remote sensing where labeled data is scarce. Need efficient SSL that works with limited pretraining imagery.

Method: Extract sub-images from original images and train model to predict semantic mask of where sub-image fits in original image. Creates self-supervised task for segmentation.

Result: Faster convergence and equal/better mIoU on downstream segmentation. Performance gap widens with less labeled data. Works across architectures and datasets with less pretraining data than other SSL methods.

Conclusion: Subimage Overlap Prediction is an effective SSL approach for remote sensing segmentation that reduces pretraining data requirements while maintaining or improving performance.

Abstract: Self-supervised learning (SSL) methods have become a dominant paradigm for creating general purpose models whose capabilities can be transferred to downstream supervised learning tasks. However, most such methods rely on vast amounts of pretraining data. This work introduces Subimage Overlap Prediction, a novel self-supervised pretraining task to aid semantic segmentation in remote sensing imagery that uses significantly lesser pretraining imagery. Given an image, a sub-image is extracted and the model is trained to produce a semantic mask of the location of the extracted sub-image within the original image. We demonstrate that pretraining with this task results in significantly faster convergence, and equal or better performance (measured via mIoU) on downstream segmentation. This gap in convergence and performance widens when labeled training data is reduced. We show this across multiple architecture types, and with multiple downstream datasets. We also show that our method matches or exceeds performance while requiring significantly lesser pretraining data relative to other SSL methods. Code and model weights are provided at \href{https://github.com/sharmalakshay93/subimage-overlap-prediction}{github.com/sharmalakshay93/subimage-overlap-prediction}.

[247] VerLM: Explaining Face Verification Using Natural Language

Syed Abdul Hannan, Hazim Bukhari, Thomas Cantalapiedra, Eman Ansar, Massa Baali, Rita Singh, Bhiksha Raj

Main category: cs.CV

TL;DR: A vision-language model for face verification that provides both concise and comprehensive explanations for its decisions, achieving superior accuracy and interpretability through cross-modal adaptation.

Details

Motivation: Face verification systems lack transparency in decision-making processes, creating a need for models that can both accurately verify faces and explain their reasoning to build trust and reliability.

Method: Adapts and enhances a state-of-the-art modeling approach originally designed for audio-based differentiation to visual inputs. The VLM is trained using two complementary explanation styles: (1) concise explanations summarizing key decision factors, and (2) comprehensive explanations detailing specific differences between images.

Result: Demonstrates superior performance, surpassing baseline methods and existing models in both accuracy and interpretability. The cross-modal transfer significantly improves model accuracy and explainability.

Conclusion: Vision-language models have immense potential in face verification, contributing to more transparent, reliable, and explainable systems by integrating sophisticated feature extraction with advanced reasoning capabilities.

Abstract: Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model’s accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.

[248] Adaptive Hybrid Optimizer based Framework for Lumpy Skin Disease Identification

Ubaidullah, Muhammad Abid Hussain, Mohsin Raza Jafri, Rozi Khan, Moid Sandhu, Abd Ullah Khan, Hyundong Shin

Main category: cs.CV

TL;DR: LUMPNet is a hybrid deep learning system combining YOLOv11 for lesion detection and EfficientNet for classification, achieving 99% training and 98% validation accuracy for Lumpy Skin Disease detection.

Details

Motivation: Lumpy Skin Disease is a contagious viral infection threatening livestock health, global economy, and food security. Early and precise identification is crucial due to its rapid spread characteristics to prevent outbreaks and ensure timely intervention.

Method: LUMPNet uses a hybrid approach: YOLOv11 for detecting and localizing LSD skin nodules/lesions on cattle images, EfficientNet-based CNN classifier with compound scaling for classifying localized images into LSD-affected or healthy categories, and a novel adaptive hybrid optimizer to stabilize and accelerate training of the hybrid model.

Result: Achieves 99% LSD detection training accuracy and 98% validation accuracy, outperforming existing schemes. Also shows superior performance compared to an optimized EfficientNet-B0 model trained with AdamW optimizer in a case study.

Conclusion: LUMPNet provides an effective hybrid deep learning solution for early LSD detection with high accuracy, addressing the need for timely intervention against this economically significant livestock disease.

Abstract: Lumpy Skin Disease (LSD) is a contagious viral infection that significantly deteriorates livestock health, thereby posing a serious threat to the global economy and food security. Owing to its rapid spread characteristics, early and precise identification is crucial to prevent outbreaks and ensure timely intervention. In this paper, we propose a hybrid deep learning-based approach called LUMPNet for the early detection of LSD. LUMPNet utilizes image data to detect and classify skin nodules – the primary indicator of LSD. To this end, LUMPNet uses YOLOv11, EfficientNet-based CNN classifier with compound scaling, and a novel adaptive hybrid optimizer. More precisely, LUMPNet detects and localizes LSD skin nodules and lesions on cattle images. It exploits EfficientNet to classify the localized cattle images into LSD-affected or healthy categories. To stabilize and accelerate the training of YOLOv11 and EfficientNet hybrid model, a novel adaptive hybrid optimizer is proposed and utilized. We evaluate LUMPNet at various stages of LSD using a publicly available dataset. Results indicate that the proposed scheme achieves 99% LSD detection training accuracy, and outperforms existing schemes. The model also achieves validation accuracy of 98%. Moreover, for further evaluation, we conduct a case study using an optimized EfficientNet-B0 model trained with the AdamW optimizer, and compare its performance with LUMPNet. The results show that LUMPNet achieves superior performance.

[249] Causality-Aware Temporal Projection for Video Understanding in Video-LLMs

Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang, Xiaoyu Deng, Ye Zhang

Main category: cs.CV

TL;DR: V-CORE introduces explicit temporal ordering constraints for Video-LLMs using learnable spatial aggregation and causality-aware temporal projection to improve video understanding requiring temporal and causal reasoning.

Details

Motivation: Current Video-LLMs struggle with tasks requiring consistent temporal ordering and causal coherence because they use unconstrained bidirectional projectors that blur temporal relationships by allowing later frames to influence earlier representations.

Method: V-CORE framework with two components: (1) Learnable Spatial Aggregation (LSA) adaptively selects salient spatial tokens to reduce redundancy, and (2) Causality-Aware Temporal Projector (CATP) enforces unidirectional information flow via block-causal attention and a terminal dynamic summary token as a causal sink.

Result: Achieves 61.2% accuracy on challenging NExT-QA benchmark, remains competitive on MSVD-QA, MSRVTT-QA, and TGIF-QA, with significant gains in temporal (+3.5%) and causal reasoning (+5.2%) subcategories.

Conclusion: Explicit temporal ordering constraints are crucial for video understanding, and V-CORE demonstrates that parameter-efficient frameworks with causality-aware design can significantly improve temporal and causal reasoning in Video-LLMs.

Abstract: Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.

[250] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

Sungjune Park, Hongda Mao, Qingshuang Chen, Yong Man Ro, Yelin Kim

Main category: cs.CV

TL;DR: A language-guided scene context-aware learning framework for egocentric visual attention prediction that uses language descriptions to generate context-aware video representations and focuses on relevant regions while suppressing distractions.

Details

Motivation: Egocentric visual attention prediction is challenging due to complexity and ambiguity of dynamic egocentric scenes. Scene contextual information plays crucial role in modulating human attention, motivating a context-aware approach.

Method: Language-guided scene context-aware learning framework with: 1) Context perceiver guided by language-based scene descriptions to generate context-aware video representations, 2) Two training objectives: focus on target point-of-interest regions and suppress distractions from irrelevant regions.

Result: Achieves state-of-the-art performance on Ego4D and Aria Everyday Activities (AEA) datasets, demonstrating effectiveness and enhanced robustness across diverse, dynamic egocentric scenarios.

Conclusion: The proposed language-guided scene context-aware learning framework effectively addresses challenges in egocentric visual attention prediction by leveraging scene context information and language guidance for robust performance.

Abstract: As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.

[251] RSwinV2-MD: An Enhanced Residual SwinV2 Transformer for Monkeypox Detection from Skin Images

Rashid Iqbal, Saddam Hussain Khan

Main category: cs.CV

TL;DR: Customized Residual SwinTransformerV2 (RSwinV2) achieves 96.21% accuracy for Mpox diagnosis by combining hierarchical transformer architecture with Inverse Residual Blocks to capture both global and local patterns in skin lesion images.

Details

Motivation: To develop an improved deep learning approach for Mpox diagnosis that addresses limitations of standard CNN models and SwinTransformers in handling lesion classification, particularly in distinguishing Mpox from similar conditions like chickenpox, measles, and cowpox.

Method: RSwinV2 customizes hierarchical transformer architecture with patch splitting, shifted window attention, and patch/position embeddings. It incorporates Inverse Residual Blocks (IRB) with convolutional skip connections to address vanishing gradient issues and enable both global and local pattern linking.

Result: Achieved 96.21% accuracy and 95.62 F1-score on Kaggle public dataset, outperforming standard CNN models and SwinTransformers. The method effectively minimized Mpox variability while increasing differences between Mpox and similar diseases.

Conclusion: RSwinV2 proves to be a valuable computer-assisted tool for Mpox lesion interpretation, successfully combining transformer global-linking capabilities with local pattern recognition through IRB integration for enhanced diagnostic performance.

Abstract: In this paper, a deep learning approach for Mpox diagnosis named Customized Residual SwinTransformerV2 (RSwinV2) has been proposed, trying to enhance the capability of lesion classification by employing the RSwinV2 tool-assisted vision approach. In the RSwinV2 method, a hierarchical structure of the transformer has been customized based on the input dimensionality, embedding structure, and output targeted by the method. In this RSwinV2 approach, the input image has been split into non-overlapping patches and processed using shifted windows and attention in these patches. This process has helped the method link all the windows efficiently by avoiding the locality issues of non-overlapping regions in attention, while being computationally efficient. RSwinV2 has further developed based on SwinTransformer and has included patch and position embeddings to take advantage of the transformer global-linking capability by employing multi-head attention in these embeddings. Furthermore, RSwinV2 has developed and incorporated the Inverse Residual Block (IRB) into this method, which utilizes convolutional skip connections with these inclusive designs to address the vanishing gradient issues during processing. RSwinV2 inclusion of IRB has therefore facilitated this method to link global patterns as well as local patterns; hence, its integrity has helped improve lesion classification capability by minimizing variability of Mpox and increasing differences of Mpox, chickenpox, measles, and cowpox. In testing SwinV2, its accuracy of 96.21 and an F1score of 95.62 have been achieved on the Kaggle public dataset, which has outperformed standard CNN models and SwinTransformers; RSwinV2 vector has thus proved its valiance as a computer-assisted tool for Mpox lesion observation interpretation.

[252] ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting

Chuhang Ma, Shuai Tan, Ye Pan, Jiaolong Yang, Xin Tong

Main category: cs.CV

TL;DR: ESGaussianFace: An efficient 3D Gaussian Splatting framework for emotional and stylized audio-driven facial animation that generates high-quality, 3D-consistent talking head videos with accurate lip movements, emotional expressions, and style features.

Details

Motivation: Most current audio-driven facial animation focuses on neutral emotions, and while some address emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge.

Method: Uses 3D Gaussian Splatting for 3D scene reconstruction and video rendering; proposes emotion-audio-guided spatial attention to integrate emotion features with audio content; introduces two 3D Gaussian deformation predictors for emotional and stylized deformations; employs multi-stage training strategy for step-by-step learning of lip movements, emotional variations, and style features.

Result: The method generates results with high efficiency, high quality, and 3D consistency; extensive experiments show it outperforms state-of-the-art techniques in lip movement accuracy, expression variation, and style feature expressiveness.

Conclusion: ESGaussianFace successfully addresses the challenge of generating emotional and stylized audio-driven facial animation with 3D consistency, offering superior performance compared to existing methods across multiple evaluation metrics.

Abstract: Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the character’s lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.

[253] GCR: Geometry-Consistent Routing for Task-Agnostic Continual Anomaly Detection

Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu, Zeming Liang, Xi Yuan, Dayan Zhang, Zhenglin Chen, Peiwu Qin, Ilmoon Chae

Main category: cs.CV

TL;DR: GCR is a geometry-consistent routing framework for task-agnostic continual anomaly detection that stabilizes expert selection by routing in shared embedding space, avoiding cross-head score comparability issues.

Details

Motivation: Practical anomaly detection deployments increasingly require task-agnostic operation under continual category expansion, where existing routing methods fail due to unreliable cross-head score comparisons as score distributions differ substantially across categories.

Method: GCR uses geometry-consistent routing in a shared frozen patch-embedding space, minimizing nearest-prototype distances to category-specific prototype banks for expert selection, then computes anomaly maps only within the routed expert using standard prototype-based scoring.

Result: Experiments on MVTec AD and VisA show substantial improvement in routing stability, mitigation of continual performance collapse, near-zero forgetting while maintaining competitive detection and localization performance.

Conclusion: Many failures previously attributed to representation forgetting can instead be explained by decision-rule instability in cross-head routing, and GCR’s geometry-consistent routing provides an effective solution for task-agnostic continual anomaly detection.

Abstract: Feature-based anomaly detection is widely adopted in industrial inspection due to the strong representational power of large pre-trained vision encoders. While most existing methods focus on improving within-category anomaly scoring, practical deployments increasingly require task-agnostic operation under continual category expansion, where the category identity is unknown at test time. In this setting, overall performance is often dominated by expert selection, namely routing an input to an appropriate normality model before any head-specific scoring is applied. However, routing rules that compare head-specific anomaly scores across independently constructed heads are unreliable in practice, as score distributions can differ substantially across categories in scale and tail behavior. We propose GCR, a lightweight mixture-of-experts framework for stabilizing task-agnostic continual anomaly detection through geometry-consistent routing. GCR routes each test image directly in a shared frozen patch-embedding space by minimizing an accumulated nearest-prototype distance to category-specific prototype banks, and then computes anomaly maps only within the routed expert using a standard prototype-based scoring rule. By separating cross-head decision making from within-head anomaly scoring, GCR avoids cross-head score comparability issues without requiring end-to-end representation learning. Experiments on MVTec AD and VisA show that geometry-consistent routing substantially improves routing stability and mitigates continual performance collapse, achieving near-zero forgetting while maintaining competitive detection and localization performance. These results indicate that many failures previously attributed to representation forgetting can instead be explained by decision-rule instability in cross-head routing. Code is available at https://github.com/jw-chae/GCR

[254] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, Hangjie Yuan

Main category: cs.CV

TL;DR: CogFlow is a cognitive-inspired three-stage framework for visual mathematical problem solving that addresses the gap between visual perception and reasoning through knowledge internalization, synergistic visual rewards, and visual-gated policy optimization.

Details

Motivation: Current multimodal LLMs struggle with visual mathematical problem solving because they focus only on improving visual extraction but ignore whether extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning.

Method: A three-stage framework (perception→internalization→reasoning) with: 1) Synergistic Visual Rewards to boost perception in parametric and semantic spaces, 2) Knowledge Internalization Reward model to bridge perception and reasoning, and 3) Visual-Gated Policy Optimization to enforce visual grounding in reasoning chains.

Result: Comprehensive experiments on visual mathematical reasoning benchmarks validate the superiority of CogFlow, supported by a new dataset MathCog with 120K+ high-quality perception-reasoning aligned annotations.

Conclusion: CogFlow effectively addresses the visual mathematical reasoning bottleneck by simulating human cognitive flow and ensuring faithful integration of visual cues into reasoning, outperforming existing approaches.

Abstract: Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalization$\Rightarrow$reasoning. Inline with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams. To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a Knowledge Internalization Reward model in the internalization stage, bridging perception and reasoning. Moreover, we design a Visual-Gated Policy Optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains. Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations. Comprehensive experiments and analysis on commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.

[255] RRNet: Configurable Real-Time Video Enhancement with Arbitrary Local Lighting Variations

Wenlong Yang, Canran Jin, Weihang Yuan, Chao Wang, Lifeng Sun

Main category: cs.CV

TL;DR: RRNet is a lightweight real-time video enhancement framework that uses virtual light sources and depth-aware rendering for localized relighting without pixel-aligned training data, achieving state-of-the-art balance between quality and efficiency.

Details

Motivation: Existing methods struggle to balance speed and effective exposure control in real-time video enhancement, particularly under uneven lighting conditions. There's growing demand for practical solutions in live applications like video conferencing and mobile photography.

Method: RRNet uses a lightweight framework with a streamlined encoder and prediction head to estimate parameters for virtual light sources. It employs a depth-aware rendering module for localized relighting without requiring pixel-aligned training data. The method includes a generative AI-based dataset creation pipeline to synthesize diverse lighting conditions for training.

Result: RRNet consistently outperforms prior methods in low-light enhancement, localized illumination adjustment, and glare removal. It achieves state-of-the-art tradeoff between visual quality and efficiency while preserving facial identity and supporting real-time, high-resolution performance.

Conclusion: RRNet’s interpretable lighting control and efficient architecture make it well-suited for practical applications like video conferencing, AR-based portrait enhancement, and mobile photography, offering a superior solution for real-time video enhancement under challenging lighting conditions.

Abstract: With the growing demand for real-time video enhancement in live applications, existing methods often struggle to balance speed and effective exposure control, particularly under uneven lighting. We introduce RRNet (Rendering Relighting Network), a lightweight and configurable framework that achieves a state-of-the-art tradeoff between visual quality and efficiency. By estimating parameters for a minimal set of virtual light sources, RRNet enables localized relighting through a depth-aware rendering module without requiring pixel-aligned training data. This object-aware formulation preserves facial identity and supports real-time, high-resolution performance using a streamlined encoder and lightweight prediction head. To facilitate training, we propose a generative AI-based dataset creation pipeline that synthesizes diverse lighting conditions at low cost. With its interpretable lighting control and efficient architecture, RRNet is well suited for practical applications such as video conferencing, AR-based portrait enhancement, and mobile photography. Experiments show that RRNet consistently outperforms prior methods in low-light enhancement, localized illumination adjustment, and glare removal.

[256] Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion

Wenyu Shao, Hongbo Liu, Yunchuan Ma, Ruili Wang

Main category: cs.CV

TL;DR: EGMT: Entity-guided multi-task learning for infrared-visible image fusion using entity-level text from captions to reduce semantic noise and improve fusion quality.

Details

Motivation: Existing text-driven fusion methods use sentence-level text, causing semantic noise from redundant information and failing to fully exploit deeper semantic value of textual information.

Method: Three key innovations: 1) Entity extraction from image captions using large vision-language models, 2) Parallel multi-task learning architecture combining image fusion with entity-based multi-label classification, 3) Entity-guided cross-modal interactive module for fine-grained visual-text feature interaction.

Result: EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency compared to state-of-the-art methods. Four entity-annotated datasets (TNO, RoadScene, M3FD, MSRS) released.

Conclusion: The entity-guided approach effectively reduces semantic noise, enables deeper semantic understanding, and improves fusion quality through cross-modal interaction between visual features and entity-level textual information.

Abstract: Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at https://github.com/wyshao-01/EGMT.

[257] Nodule-DETR: A Novel DETR Architecture with Frequency-Channel Attention for Ultrasound Thyroid Nodule Detection

Jingjing Wang, Qianglin Liu, Zhuo Xiao, Xinning Yao, Bo Liu, Lu Li, Lijuan Niu, Fugen Zhou

Main category: cs.CV

TL;DR: Nodule-DETR: A novel detection transformer architecture for thyroid nodule detection in ultrasound images, achieving state-of-the-art performance with three key innovations for handling low-contrast and irregular nodules.

Details

Motivation: Thyroid cancer incidence is rising globally, and while ultrasound is the preferred imaging modality, its diagnostic accuracy is limited by low image contrast and blurred nodule boundaries, necessitating better detection methods.

Method: Proposes Nodule-DETR with three innovations: 1) Multi-Spectral Frequency-domain Channel Attention (MSFCA) for enhancing low-contrast nodule features using frequency analysis; 2) Hierarchical Feature Fusion (HFF) for efficient multi-scale integration; 3) Multi-Scale Deformable Attention (MSDA) to capture small and irregularly shaped nodules.

Result: Achieves state-of-the-art performance on clinical thyroid ultrasound dataset, outperforming baseline by 0.149 in mAP@0.5:0.95, demonstrating superior accuracy for clinical application.

Conclusion: Nodule-DETR shows significant potential as an effective tool for computer-aided thyroid diagnosis, with code publicly available for further research and clinical implementation.

Abstract: Thyroid cancer is the most common endocrine malignancy, and its incidence is rising globally. While ultrasound is the preferred imaging modality for detecting thyroid nodules, its diagnostic accuracy is often limited by challenges such as low image contrast and blurred nodule boundaries. To address these issues, we propose Nodule-DETR, a novel detection transformer (DETR) architecture designed for robust thyroid nodule detection in ultrasound images. Nodule-DETR introduces three key innovations: a Multi-Spectral Frequency-domain Channel Attention (MSFCA) module that leverages frequency analysis to enhance features of low-contrast nodules; a Hierarchical Feature Fusion (HFF) module for efficient multi-scale integration; and Multi-Scale Deformable Attention (MSDA) to flexibly capture small and irregularly shaped nodules. We conducted extensive experiments on a clinical dataset of real-world thyroid ultrasound images. The results demonstrate that Nodule-DETR achieves state-of-the-art performance, outperforming the baseline model by a significant margin of 0.149 in mAP@0.5:0.95. The superior accuracy of Nodule-DETR highlights its significant potential for clinical application as an effective tool in computer-aided thyroid diagnosis. The code of work is available at https://github.com/wjj1wjj/Nodule-DETR.

[258] Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems

Niloufar Alipour Talemi, Julia Boone, Fatemeh Afghah

Main category: cs.CV

TL;DR: This survey paper provides the first comprehensive review of agentic AI in remote sensing, analyzing the shift from static deep learning models to autonomous AI agents capable of sequential planning and tool orchestration for complex geospatial workflows.

Details

Motivation: The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Current vision foundation models and multimodal large language models advance representation learning but lack the sequential planning and active tool orchestration required for complex geospatial workflows.

Method: The survey introduces a unified taxonomy distinguishing between single-agent copilots and multi-agent systems, while analyzing architectural foundations such as planning mechanisms, retrieval-augmented generation, and memory structures. It also reviews emerging benchmarks that move evaluation from pixel-level accuracy to trajectory-aware reasoning correctness.

Result: The paper presents the first comprehensive review of agentic AI in remote sensing, establishing a taxonomy and analyzing key architectural components. It critically examines limitations in grounding, safety, and orchestration.

Conclusion: This work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence by addressing current limitations and providing a framework for advancing agentic AI in remote sensing applications.

Abstract: The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Although recent vision foundation models and multimodal large language models advance representation learning, they often lack the sequential planning and active tool orchestration required for complex geospatial workflows. This survey presents the first comprehensive review of agentic AI in remote sensing. We introduce a unified taxonomy distinguishing between single-agent copilots and multi-agent systems while analyzing architectural foundations such as planning mechanisms, retrieval-augmented generation, and memory structures. Furthermore, we review emerging benchmarks that move the evaluation from pixel-level accuracy to trajectory-aware reasoning correctness. By critically examining limitations in grounding, safety, and orchestration, this work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence.

[259] Forget Less by Learning from Parents Through Hierarchical Relationships

Arjun Ramesh Kaushik, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Nalini K. Ratha, Venu Govindaraju

Main category: cs.CV

TL;DR: FLLP introduces a parent-child learning mechanism in hyperbolic space to prevent catastrophic forgetting in custom diffusion models during sequential concept learning.

Details

Motivation: Custom Diffusion Models suffer from catastrophic forgetting when learning new concepts sequentially, and existing approaches focus too much on minimizing interference while ignoring potential positive inter-concept interactions.

Method: FLLP uses a parent-child inter-concept learning mechanism in hyperbolic space (Lorentzian manifold), where previously learned concepts serve as guidance for adapting to new ones, naturally modeling tree-like hierarchies.

Result: The method shows consistent improvements in both robustness and generalization across three public datasets and one synthetic benchmark.

Conclusion: FLLP effectively mitigates forgetting in sequential concept learning for custom diffusion models by leveraging hyperbolic geometry to preserve prior knowledge while integrating new concepts.

Abstract: Custom Diffusion Models (CDMs) offer impressive capabilities for personalization in generative modeling, yet they remain vulnerable to catastrophic forgetting when learning new concepts sequentially. Existing approaches primarily focus on minimizing interference between concepts, often neglecting the potential for positive inter-concept interactions. In this work, we present Forget Less by Learning from Parents (FLLP), a novel framework that introduces a parent-child inter-concept learning mechanism in hyperbolic space to mitigate forgetting. By embedding concept representations within a Lorentzian manifold, naturally suited to modeling tree-like hierarchies, we define parent-child relationships in which previously learned concepts serve as guidance for adapting to new ones. Our method not only preserves prior knowledge but also supports continual integration of new concepts. We validate FLLP on three public datasets and one synthetic benchmark, showing consistent improvements in both robustness and generalization.

[260] Learning Action Hierarchies via Hybrid Geometric Diffusion

Arjun Ramesh Kaushik, Nalini K. Ratha, Venu Govindaraju

Main category: cs.CV

TL;DR: HybridTAS: A diffusion-based temporal action segmentation framework that combines Euclidean and hyperbolic geometries to exploit hierarchical action structure through coarse-to-fine denoising.

Details

Motivation: Existing iterative refinement methods for temporal action segmentation fail to explicitly utilize the hierarchical nature of human actions, which naturally have tree-like relationships between abstract and fine-grained categories.

Method: Proposes HybridTAS framework incorporating both Euclidean and hyperbolic geometries into diffusion model denoising. Hyperbolic geometry provides tree-like embedding relationships, enabling coarse-to-fine guidance: higher diffusion timesteps use abstract action categories (root nodes), while lower timesteps refine with fine-grained classes (leaf nodes).

Result: Achieves state-of-the-art performance on three benchmark datasets: GTEA, 50Salads, and Breakfast, validating the effectiveness of hyperbolic-guided denoising for temporal action segmentation.

Conclusion: Incorporating hyperbolic geometry into diffusion models effectively captures hierarchical action structure, enabling improved temporal action segmentation through coarse-to-fine denoising guidance.

Abstract: Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.

[261] TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing

Yujie Hu, Zecheng Tang, Xu Jiang, Weiqi Li, Jian Zhang

Main category: cs.CV

TL;DR: TalkPhoto: A training-free conversational image editing framework that uses LLMs to analyze user instructions and hierarchically invoke existing editing methods without additional training.

Details

Motivation: Existing MLLM-based image editing methods require building multi-instruction datasets for training, which is time-consuming, labor-intensive, and often fails to achieve satisfactory results. There's a need for a more efficient, flexible approach that doesn't require additional training.

Method: TalkPhoto uses a training-free framework where an open-source LLM analyzes user instructions via specially designed prompt templates, then hierarchically invokes existing advanced editing methods. It features plug-and-play invocation of editing methods, allowing integration of complex and unseen editing tasks without retraining.

Result: Extensive experiments show the method provides more accurate invocation with fewer token consumption and achieves higher editing quality across various image editing tasks compared to existing approaches.

Conclusion: TalkPhoto demonstrates that training-free conversational image editing is feasible and effective, offering precise manipulation through conversational interaction while maintaining flexibility and high-quality results without the need for dataset construction or model retraining.

Abstract: Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that our method not only provides more accurate invocation with fewer token consumption but also achieves higher editing quality across various image editing tasks.

[262] AR-MOT: Autoregressive Multi-object Tracking

Lianjie Jia, Yuhan Wu, Binghao Ran, Yifan Wang, Lijun Wang, Huchuan Lu

Main category: cs.CV

TL;DR: AR-MOT is an autoregressive multi-object tracking framework using LLMs to generate tracking sequences, eliminating task-specific heads and enabling flexible adaptation to diverse tracking scenarios.

Details

Motivation: Existing MOT methods have rigid, task-specific architectures that limit applicability across diverse tasks and flexibility in adapting to new tracking formulations, especially in general and multi-modal scenarios.

Method: Formulates MOT as sequence generation within LLM framework, introduces Object Tokenizer for region-level perception, Region-Aware Alignment module to mitigate feature misalignment, and Temporal Memory Fusion module for long-term tracking by caching historical object tokens.

Result: Achieves performance comparable to state-of-the-art methods on MOT17 and DanceTrack benchmarks, validating feasibility while enabling extensibility to new modalities/instructions through sequence format modification.

Conclusion: AR-MOT provides a foundation for more general and flexible MOT systems by eliminating task-specific architectures and enabling easy integration of new modalities/instructions through sequence format modifications.

Abstract: As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.

[263] VIT-Ped: Visionary Intention Transformer for Pedestrian Behavior Analysis

Aly R. Elkammar, Karim M. Gamaleldin, Catherine M. Elias

Main category: cs.CV

TL;DR: Transformer-based pedestrian intention prediction algorithm achieves state-of-the-art performance on JAAD dataset using multiple data modalities.

Details

Motivation: Pedestrian intention prediction is crucial for advancing from level 3 to level 4 autonomous driving, requiring comprehensive analysis of pedestrian crossing behavior to improve road safety.

Method: Developed transformer/video vision transformer based algorithms of different sizes that utilize multiple data modalities for pedestrian intention prediction.

Result: Achieved state-of-the-art performance on JAAD dataset, surpassing previous methods in Accuracy, AUC, and F1-score metrics.

Conclusion: The transformer-based approach effectively predicts pedestrian intentions, with ablation studies validating the benefits of different model design choices for autonomous driving safety applications.

Abstract: Pedestrian Intention prediction is one of the key technologies in the transition from level 3 to level 4 autonomous driving. To understand pedestrian crossing behaviour, several elements and features should be taken into consideration to make the roads of tomorrow safer for everybody. We introduce a transformer / video vision transformer based algorithm of different sizes which uses different data modalities .We evaluated our algorithms on popular pedestrian behaviour dataset, JAAD, and have reached SOTA performance and passed the SOTA in metrics like Accuracy, AUC and F1-score. The advantages brought by different model design choices are investigated via extensive ablation studies.

[264] MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering

Zhifei Li, Yiran Wang, Chenyi Xiong, Yujing Xia, Xiaoju Hou, Yue Zhao, Miao Zhang, Kui Xiao, Bing Yang

Main category: cs.CV

TL;DR: MacVQA is a novel continual learning framework for VQA that uses adaptive memory allocation and global noise filtering to balance knowledge retention, adaptation, and robust feature representation.

Details

Motivation: Current continual learning methods for VQA struggle with balancing knowledge retention, adaptation to new information, and robust feature representation, creating challenges for effective continual VQA learning.

Method: Proposes MacVQA with two key components: 1) fusion of visual and question information with global noise filtering for robust representations, and 2) prototype-based memory allocation to optimize feature quality and memory usage.

Result: Outperforms existing baselines on ten continual VQA tasks, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.

Conclusion: MacVQA effectively balances knowledge acquisition, retention, and compositional generalization in continual VQA learning through its adaptive memory allocation and noise filtering mechanisms.

Abstract: Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.

[265] Enhancing Object Detection with Privileged Information: A Model-Agnostic Teacher-Student Approach

Matthias Bartolo, Dylan Seychell, Gabriel Hili, Matthew Montebello, Carl James Debono, Saviour Formosa, Konstantinos Makantasis

Main category: cs.CV

TL;DR: LUPI paradigm integrated into object detection using teacher-student architecture to leverage privileged information (masks, saliency, depth) available only during training, improving accuracy without increasing inference complexity.

Details

Motivation: To exploit fine-grained descriptive information available during training but not at inference time, enhancing object detection performance without increasing model complexity or inference costs.

Method: Model-agnostic teacher-student architecture that injects privileged information (bounding box masks, saliency maps, depth cues) into object detectors during training, with intermediate weighting of teacher guidance.

Result: LUPI-trained detectors consistently outperform baselines with significant accuracy boosts, especially for medium/large objects, with no inference complexity increase. Tested on 5 SOTA models across multiple benchmarks including UAV litter detection and Pascal VOC.

Conclusion: LUPI framework provides effective, practical strategy for advancing object detection in resource-constrained and real-world settings by leveraging privileged information during training without inference overhead.

Abstract: This paper investigates the integration of the Learning Using Privileged Information (LUPI) paradigm in object detection to exploit fine-grained, descriptive information available during training but not at inference. We introduce a general, model-agnostic methodology for injecting privileged information-such as bounding box masks, saliency maps, and depth cues-into deep learning-based object detectors through a teacher-student architecture. Experiments are conducted across five state-of-the-art object detection models and multiple public benchmarks, including UAV-based litter detection datasets and Pascal VOC 2012, to assess the impact on accuracy, generalization, and computational efficiency. Our results demonstrate that LUPI-trained students consistently outperform their baseline counterparts, achieving significant boosts in detection accuracy with no increase in inference complexity or model size. Performance improvements are especially marked for medium and large objects, while ablation studies reveal that intermediate weighting of teacher guidance optimally balances learning from privileged and standard inputs. The findings affirm that the LUPI framework provides an effective and practical strategy for advancing object detection systems in both resource-constrained and real-world settings.

[266] Face Normal Estimation from Rags to Riches

Meng Wang, Wenjing Dai, Jiawan Zhang, Xiaojie Guo

Main category: cs.CV

TL;DR: A coarse-to-fine face normal estimation method that reduces dependency on large-scale paired training data by first generating coarse normals from a small dataset, then refining them using self-attention to capture long-range dependencies.

Details

Motivation: Current face normal estimation methods require large-scale paired data for training, which is resource-intensive. The paper aims to reduce this dependency by developing a more efficient training approach.

Method: Two-stage approach: 1) Train a neat model on small dataset to produce coarse face normals as exemplars, 2) Use refinement network with self-attention mechanism to capture long-range dependencies and refine coarse normals into high-quality facial normals.

Result: The method significantly reduces training data requirements and computational resources while achieving superior performance over state-of-the-art methods in both training expense and estimation quality.

Conclusion: The coarse-to-fine approach with logical function split effectively reduces dependency on massive paired data and computational resources while maintaining high-quality face normal estimation, with code and models open-sourced.

Abstract: Although recent approaches to face normal estimation have achieved promising results, their effectiveness heavily depends on large-scale paired data for training. This paper concentrates on relieving this requirement via developing a coarse-to-fine normal estimator. Concretely, our method first trains a neat model from a small dataset to produce coarse face normals that perform as guidance (called exemplars) for the following refinement. A self-attention mechanism is employed to capture long-range dependencies, thus remedying severe local artifacts left in estimated coarse facial normals. Then, a refinement network is customized for the sake of mapping input face images together with corresponding exemplars to fine-grained high-quality facial normals. Such a logical function split can significantly cut the requirement of massive paired data and computational resource. Extensive experiments and ablation studies are conducted to demonstrate the efficacy of our design and reveal its superiority over state-of-the-art methods in terms of both training expense as well as estimation quality. Our code and models are open-sourced at: https://github.com/AutoHDR/FNR2R.git.

[267] MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

Zhexin Zhang, Yifeng Zhu, Yangyang Xu, Long Chen, Yong Du, Shengfeng He, Jun Yu

Main category: cs.CV

TL;DR: MotionAdapter is a framework for transferring complex motions between videos using diffusion-based text-to-video models, achieving robust motion transfer through explicit motion-appearance disentanglement and adaptive motion customization.

Details

Motivation: While diffusion-based text-to-video models have made progress in generating high-quality videos, transferring complex motions between videos remains challenging. Existing methods struggle with robust and semantically aligned motion transfer.

Method: MotionAdapter uses a two-stage approach: 1) isolates motion by analyzing cross-frame attention in 3D full-attention modules to extract attention-derived motion fields, and 2) introduces a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences between reference and target videos. The customized motion field then guides the DiT denoising process.

Result: Extensive experiments show MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. The framework naturally supports complex motion transfer and motion editing tasks like zooming.

Conclusion: MotionAdapter provides an effective solution for motion transfer in diffusion-based video generation by explicitly disentangling motion from appearance and adaptively customizing motion to target content, enabling robust and semantically aligned motion transfer.

Abstract: Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring complex motions between videos remains challenging. In this work, we present MotionAdapter, a content-aware motion transfer framework that enables robust and semantically aligned motion transfer within DiT-based T2V models. Our key insight is that effective motion transfer requires \romannumeral1) explicit disentanglement of motion from appearance and \romannumeral 2) adaptive customization of motion to target content. MotionAdapter first isolates motion by analyzing cross-frame attention within 3D full-attention modules to extract attention-derived motion fields. To bridge the semantic gap between reference and target videos, we further introduce a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences. The customized motion field is then used to guide the DiT denoising process, ensuring that the synthesized video inherits the reference motion while preserving target appearance and semantics. Extensive experiments demonstrate that MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.

[268] Agentic Retoucher for Text-To-Image Generation

Shaocheng Shen, Jianfeng Liang. Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

Main category: cs.CV

TL;DR: Agentic Retoucher is a hierarchical decision-driven framework that reformulates post-generation correction of text-to-image models as a perception-reasoning-action loop, outperforming existing methods in perceptual quality and distortion localization.

Details

Motivation: Current text-to-image diffusion models like SDXL and FLUX suffer from pervasive small-scale distortions in limbs, face, text, etc. Existing refinement approaches either require costly iterative re-generation or rely on vision-language models with weak spatial grounding, leading to semantic drift and unreliable local edits.

Method: Proposes a hierarchical decision-driven framework with three agents: (1) perception agent that learns contextual saliency for fine-grained distortion localization using text-image consistency cues, (2) reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) action agent that adaptively plans localized inpainting guided by user preference. Also introduces GenBlemish-27K dataset with 6K T2I images and 27K annotated artifact regions across 12 categories for fine-grained supervision.

Result: Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization, and human preference alignment.

Conclusion: Agentic Retoucher establishes a new paradigm for self-corrective and perceptually reliable text-to-image generation by integrating perceptual evidence, linguistic reasoning, and controllable correction into a unified decision process.

Abstract: Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.

[269] AFTER: Mitigating the Object Hallucination of LVLM via Adaptive Factual-Guided Activation Editing

Tianbo Wang, Yuqing Ma, Kewei Liao, Zhange Zhang, Simin Li, Jinyang Guo, Xianglong Liu

Main category: cs.CV

TL;DR: AFTER is a novel method that uses factual textual semantics to guide activation editing in LVLMs, reducing object hallucination by up to 16.3% through adaptive visual-textual alignment.

Details

Motivation: LVLMs suffer from object hallucination (category, attribute, relation) due to language bias, which hinders trustworthy AI applications. Existing editing approaches fail to leverage factual textual semantics to explicitly mitigate this bias.

Method: AFTER consists of two components: 1) Factual-Augmented Activation Steering (FAS) - provides factual guidance for activation editing by modeling precise visual-textual associations; 2) Query-Adaptive Offset Optimization (QAO) - introduces query-aware offset estimator for query-specific editing from general steering vectors, enhancing editing diversity and granularity.

Result: Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs show significant improvements, achieving up to 16.3% reduction in hallucination over baseline on the AMBER benchmark.

Conclusion: AFTER effectively mitigates object hallucination in LVLMs by adaptively guiding biased activations toward factual semantics through factual-guided visual-textual editing, demonstrating superior performance over previous approaches.

Abstract: Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.

[270] Forget Less by Learning Together through Concept Consolidation

Arjun Ramesh Kaushik, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Nalini Ratha, Venu Govindaraju

Main category: cs.CV

TL;DR: FL2T framework enables concurrent, order-agnostic concept learning in custom diffusion models while mitigating catastrophic forgetting through inter-concept guidance.

Details

Motivation: Existing custom diffusion models suffer from catastrophic forgetting when learning new concepts sequentially, and prior works neglect inter-concept interactions and assume fixed concept order.

Method: Proposes FL2T framework with set-invariant inter-concept learning module where proxies guide feature selection across concepts, enabling concurrent and order-agnostic concept learning while preserving old knowledge.

Result: Extensive experiments across three datasets show significant improvement in concept retention, mitigating catastrophic forgetting with at least 2% average gain in CLIP Image Alignment scores across ten tasks.

Conclusion: The FL2T framework effectively addresses catastrophic forgetting in custom diffusion models by leveraging inter-concept catalytic behavior, enabling more robust incremental concept learning.

Abstract: Custom Diffusion Models (CDMs) have gained significant attention due to their remarkable ability to personalize generative processes. However, existing CDMs suffer from catastrophic forgetting when continuously learning new concepts. Most prior works attempt to mitigate this issue under the sequential learning setting with a fixed order of concept inflow and neglect inter-concept interactions. In this paper, we propose a novel framework - Forget Less by Learning Together (FL2T) - that enables concurrent and order-agnostic concept learning while addressing catastrophic forgetting. Specifically, we introduce a set-invariant inter-concept learning module where proxies guide feature selection across concepts, facilitating improved knowledge retention and transfer. By leveraging inter-concept guidance, our approach preserves old concepts while efficiently incorporating new ones. Extensive experiments, across three datasets, demonstrates that our method significantly improves concept retention and mitigates catastrophic forgetting, highlighting the effectiveness of inter-concept catalytic behavior in incremental concept learning of ten tasks with at least 2% gain on average CLIP Image Alignment scores.

[271] Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, Jiang Bian

Main category: cs.CV

TL;DR: This paper introduces a novel approach to enhance spatial reasoning in vision-language models by integrating object-centric blueprints, using supervised fine-tuning, reinforcement learning with blueprint-aware rewards, and anti-shortcut data augmentation.

Details

Motivation: Existing approaches to spatial reasoning in VLMs either focus too much on local patches (weakening global awareness) or mark isolated coordinates (overlooking overall organization). The authors aim to bridge this gap by incorporating cognitive concepts of object-centric representations to improve spatial semantic understanding.

Method: The method integrates object-centric blueprints into VLMs through three key techniques: 1) blueprint-embedded reasoning traces for supervised fine-tuning to teach basic reasoning skills, 2) blueprint-aware rewards in reinforcement learning to ensure appropriate object inclusion and causal alignment, and 3) anti-shortcut data augmentation with targeted perturbations to prevent reliance on superficial cues.

Result: Experiments demonstrate that the proposed method consistently outperforms both existing general VLMs and specialized spatial reasoning models, showing improved spatial reasoning capabilities.

Conclusion: Integrating object-centric blueprints into VLMs effectively enhances spatial reasoning by providing structured representations that capture both local details and global organization, advancing VLMs from visual perception toward true spatial semantic understanding.

Abstract: Spatial reasoning – the ability to perceive and reason about relationships in space – advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.

[272] API: Empowering Generalizable Real-World Image Dehazing via Adaptive Patch Importance Learning

Chen Zhu, Huiwen Zhang, Yujie Li, Mu He, Xiaotian Qiao

Main category: cs.CV

TL;DR: API framework for real-world image dehazing with adaptive patch importance, hybrid data augmentation, and multi-negative contrastive loss achieves SOTA performance across diverse real-world benchmarks.

Details

Motivation: Existing learning-based dehazing methods struggle with real-world complex hazy scenes due to limited training data and complex haze density distributions, leading to performance degradation.

Method: Proposes API framework with: 1) Automatic Haze Generation (AHG) module for hybrid data augmentation, 2) Density-aware Haze Removal (DHR) module for adaptive patch importance-aware dehazing, and 3) Multi-Negative Contrastive Dehazing (MNCD) loss using spatial and frequency domain information.

Result: Achieves state-of-the-art performance across multiple real-world benchmarks, with strong quantitative metrics and qualitative visual quality, demonstrating robust generalization across diverse haze distributions.

Conclusion: The proposed API framework effectively addresses real-world dehazing challenges through adaptive patch importance, realistic data augmentation, and multi-domain contrastive learning, enabling superior generalization to complex haze scenarios.

Abstract: Real-world image dehazing is a fundamental yet challenging task in low-level vision. Existing learning-based methods often suffer from significant performance degradation when applied to complex real-world hazy scenes, primarily due to limited training data and the intrinsic complexity of haze density distributions.To address these challenges, we introduce a novel Adaptive Patch Importance-aware (API) framework for generalizable real-world image dehazing. Specifically, our framework consists of an Automatic Haze Generation (AHG) module and a Density-aware Haze Removal (DHR) module. AHG provides a hybrid data augmentation strategy by generating realistic and diverse hazy images as additional high-quality training data. DHR considers hazy regions with varying haze density distributions for generalizable real-world image dehazing in an adaptive patch importance-aware manner. To alleviate the ambiguity of the dehazed image details, we further introduce a new Multi-Negative Contrastive Dehazing (MNCD) loss, which fully utilizes information from multiple negative samples across both spatial and frequency domains. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple real-world benchmarks, delivering strong results in both quantitative metrics and qualitative visual quality, and exhibiting robust generalization across diverse haze distributions.

[273] Nighttime Hazy Image Enhancement via Progressively and Mutually Reinforcing Night-Haze Priors

Chen Zhu, Huiwen Zhang, Mu He, Yujie Li, Xiaotian Qiao

Main category: cs.CV

TL;DR: A novel framework for nighttime hazy image enhancement that mutually reinforces haze and low-light priors using multi-level experts across visual and frequency domains.

Details

Motivation: Nighttime hazy images suffer from complex degradation distributions where existing methods only address single degradation types (haze or low-light) separately, ignoring their interplay and resulting in limited visibility improvement.

Method: Proposes a framework that reinforces intrinsic consistency between haze and low-light priors using image-, patch-, and pixel-level experts operating across visual and frequency domains to recover global structure, regional patterns, and fine details progressively, with a frequency-aware router to adaptively guide expert contributions.

Result: Extensive experiments demonstrate superior performance on nighttime dehazing benchmarks both quantitatively and qualitatively, plus generalizability to daytime dehazing and low-light enhancement tasks.

Conclusion: The proposed method effectively enhances nighttime hazy images by mutually reinforcing domain knowledge between low-light and haze priors through a progressive multi-level expert framework with adaptive routing.

Abstract: Enhancing the visibility of nighttime hazy images is challenging due to the complex degradation distributions. Existing methods mainly address a single type of degradation (e.g., haze or low-light) at a time, ignoring the interplay of different degradation types and resulting in limited visibility improvement. We observe that the domain knowledge shared between low-light and haze priors can be reinforced mutually for better visibility. Based on this key insight, in this paper, we propose a novel framework that enhances visibility in nighttime hazy images by reinforcing the intrinsic consistency between haze and low-light priors mutually and progressively. In particular, our model utilizes image-, patch-, and pixel-level experts that operate across visual and frequency domains to recover global scene structure, regional patterns, and fine-grained details progressively. A frequency-aware router is further introduced to adaptively guide the contribution of each expert, ensuring robust image restoration. Extensive experiments demonstrate the superior performance of our model on nighttime dehazing benchmarks both quantitatively and qualitatively. Moreover, we showcase the generalizability of our model in daytime dehazing and low-light enhancement tasks.

[274] Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement

Guangqian Guo, Aixi Ren, Yong Guo, Xuehui Yu, Jiacheng Tian, Wenli Li, Yaoxing Wang, Shan Gao

Main category: cs.CV

TL;DR: GleSAM++ enhances Segment Anything Models (SAMs) for robust segmentation on low-quality images using generative latent space enhancement with degradation-aware adaptive reconstruction.

Details

Motivation: SAMs perform poorly on severely degraded, low-quality images, limiting their real-world applicability. There's a need to improve segmentation robustness across varying image qualities.

Method: Proposes GleSAM++ with three key components: 1) Generative Latent space Enhancement for robustness, 2) Feature Distribution Alignment (FDA) and Channel Replication and Expansion (CRE) for compatibility, and 3) Degradation-aware Adaptive Enhancement (DAE) mechanism that decouples reconstruction into degradation-level prediction and degradation-aware reconstruction stages.

Result: Significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Performs well on unseen degradations with minimal additional parameters, applicable to both SAM and SAM2.

Conclusion: GleSAM++ provides an effective solution for enhancing SAMs’ performance on low-quality images through degradation-aware adaptive enhancement, enabling robust segmentation across diverse real-world scenarios with varying image qualities.

Abstract: Segment Anything Models (SAMs), known for their exceptional zero-shot segmentation performance, have garnered significant attention in the research community. Nevertheless, their performance drops significantly on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM++, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Additionally, to improve compatibility between the pre-trained diffusion model and the segmentation framework, we introduce two techniques, i.e., Feature Distribution Alignment (FDA) and Channel Replication and Expansion (CRE). However, the above components lack explicit guidance regarding the degree of degradation. The model is forced to implicitly fit a complex noise distribution that spans conditions from mild noise to severe artifacts, which substantially increases the learning burden and leads to suboptimal reconstructions. To address this issue, we further introduce a Degradation-aware Adaptive Enhancement (DAE) mechanism. The key principle of DAE is to decouple the reconstruction process for arbitrary-quality features into two stages: degradation-level prediction and degradation-aware reconstruction. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. Extensive experiments demonstrate that GleSAM++ significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM++ also performs well on unseen degradations, underscoring the versatility of our approach and dataset.

[275] Remote Sensing Change Detection via Weak Temporal Supervision

Xavier Bou, Elliot Vincent, Gabriele Facciolo, Rafael Grompone von Gioi, Jean-Michel Morel, Thibaud Ehret

Main category: cs.CV

TL;DR: Weak temporal supervision for semantic change detection using existing single-temporal datasets without new annotations, achieving strong zero-shot and low-data performance.

Details

Motivation: Addresses the scarcity of annotated datasets for semantic change detection in remote sensing, where pixel-level annotation is costly and time-consuming. Existing methods using synthetic data or artificial change pairs have limited out-of-domain generalization.

Method: Extends single-date remote sensing datasets with new temporal observations, trains change detection model by assuming real bi-temporal pairs mostly contain no change, while pairing images from different locations to generate change examples. Uses object-aware change map generation and iterative refinement to handle weak label noise.

Result: Validated on extended versions of FLAIR and IAILD aerial datasets, achieving strong zero-shot and low-data regime performance across different benchmarks. Showcased scalability with results over large areas in France.

Conclusion: Proposes a practical weak supervision approach that leverages existing datasets without new annotations, demonstrating strong generalization and scalability for semantic change detection in remote sensing.

Abstract: Semantic change detection in remote sensing aims to identify land cover changes between bi-temporal image pairs. Progress in this area has been limited by the scarcity of annotated datasets, as pixel-level annotation is costly and time-consuming. To address this, recent methods leverage synthetic data or generate artificial change pairs, but out-of-domain generalization remains limited. In this work, we introduce a weak temporal supervision strategy that leverages additional temporal observations of existing single-temporal datasets, without requiring any new annotations. Specifically, we extend single-date remote sensing datasets with new observations acquired at different times and train a change detection model by assuming that real bi-temporal pairs mostly contain no change, while pairing images from different locations to generate change examples. To handle the inherent noise in these weak labels, we employ an object-aware change map generation and an iterative refinement process. We validate our approach on extended versions of the FLAIR and IAILD aerial datasets, achieving strong zero-shot and low-data regime performance across different benchmarks. Lastly, we showcase results over large areas in France, highlighting the scalability potential of our method.

[276] Adapting Depth Anything to Adverse Imaging Conditions with Events

Shihan Peng, Yuyang Xiong, Hanyu Zhou, Zhiwei Shi, Haoyue Liu, Gang Chen, Luxin Yan, Yi Chang

Main category: cs.CV

TL;DR: ADAE enhances Depth Anything foundation model for robust depth estimation in degraded scenes (extreme illumination, motion blur) by fusing frame and event camera data using entropy-aware spatial fusion and motion-guided temporal correction.

Details

Motivation: Depth foundation models like Depth Anything work well in ideal conditions but fail under adverse imaging conditions (extreme illumination, motion blur) that corrupt visual signals. While event cameras can help with their high dynamic range and temporal resolution, existing fusion models are trained from scratch on domain-specific data and don't inherit foundation models' open-world knowledge and generalization capabilities.

Method: ADAE is an event-guided spatiotemporal fusion framework with two key components: 1) Entropy-Aware Spatial Fusion - adaptively merges frame-based and event-based features using information entropy to indicate illumination-induced degradation; 2) Motion-Guided Temporal Correction - uses event-based motion cues to recalibrate ambiguous features in blurred regions. These complementary components work together within a unified framework.

Result: Extensive experiments verify the superiority of the proposed method. The code will be released upon acceptance.

Conclusion: ADAE successfully enhances the Depth Anything foundation model for robust depth estimation under adverse imaging conditions by leveraging event camera data through intelligent spatiotemporal fusion strategies, while preserving the foundation model’s open-world knowledge and generalization capabilities.

Abstract: Robust depth estimation under dynamic and adverse lighting conditions is essential for robotic systems. Currently, depth foundation models, such as Depth Anything, achieve great success in ideal scenes but remain challenging under adverse imaging conditions such as extreme illumination and motion blur. These degradations corrupt the visual signals of frame cameras, weakening the discriminative features of frame-based depths across the spatial and temporal dimensions. Typically, existing approaches incorporate event cameras to leverage their high dynamic range and temporal resolution, aiming to compensate for corrupted frame features. However, such specialized fusion models are predominantly trained from scratch on domain-specific datasets, thereby failing to inherit the open-world knowledge and robust generalization inherent to foundation models. In this work, we propose ADAE, an event-guided spatiotemporal fusion framework for Depth Anything in degraded scenes. Our design is guided by two key insights: 1) Entropy-Aware Spatial Fusion. We adaptively merge frame-based and event-based features using an information entropy strategy to indicate illumination-induced degradation. 2) Motion-Guided Temporal Correction. We resort to the event-based motion cue to recalibrate ambiguous features in blurred regions. Under our unified framework, the two components are complementary to each other and jointly enhance Depth Anything under adverse imaging conditions. Extensive experiments have been performed to verify the superiority of the proposed method. Our code will be released upon acceptance.

[277] BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models

Sunny Gupta, Shounak Das, Amit Sethi

Main category: cs.CV

TL;DR: BiPrompt: A bilateral prompt optimization framework that simultaneously debiases both visual and textual modalities in CLIP-like vision-language models during test-time adaptation, improving robustness to spurious correlations without retraining.

Details

Motivation: Vision-language foundation models like CLIP show strong zero-shot performance but remain vulnerable to spurious correlations across modalities. Existing debiasing methods typically address only one modality (visual OR textual), leading to partial robustness and unstable adaptation under distribution shifts.

Method: BiPrompt uses a bilateral approach: (1) Visual side: structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions; (2) Textual side: balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. These modules jointly minimize conditional mutual information between spurious cues and predictions.

Result: Extensive evaluations on real-world and synthetic bias benchmarks show consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods.

Conclusion: BiPrompt establishes a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation, enabling models to steer toward causal, domain-invariant reasoning without retraining or domain supervision.

Abstract: Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.

[278] Leveraging 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding

Toshihiko Nishimura, Hirofumi Abe, Kazuhiko Murasaki, Taiga Yoshida, Ryuichi Tanida

Main category: cs.CV

TL;DR: A training-free 3D semantic segmentation method that projects point clouds to 2D, uses foundation models with natural language prompts, and aggregates multi-view predictions through weighted voting.

Details

Motivation: To overcome the limitations of supervised 3D segmentation methods that require expensive annotated 3D training data or paired RGB images, and to enable open-vocabulary recognition for arbitrary object detection.

Method: Projects 3D point clouds onto 2D images using virtual cameras, performs semantic segmentation via foundation 2D models guided by natural language prompts, and aggregates predictions from multiple viewpoints through weighted voting.

Result: Outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods, while supporting open-vocabulary recognition with arbitrary text queries.

Conclusion: The proposed method provides an effective training-free solution for 3D semantic segmentation that eliminates the need for annotated 3D data while enabling flexible open-vocabulary object recognition.

Abstract: This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.

[279] AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-Off

Yihan Zhu, Mengying Ge

Main category: cs.CV

TL;DR: AlignVTOFF is a novel parallel U-Net framework for Virtual Try-Off that addresses texture attenuation by using Reference U-Net for geometric fidelity and Texture-Spatial Feature Alignment to preserve high-frequency details.

Details

Motivation: Existing VTOFF methods struggle with preserving structured patterns and fine-grained details due to lightweight feature extraction modules, leading to texture attenuation during garment generation.

Method: Proposes AlignVTOFF with two key components: 1) Reference U-Net for multi-scale feature extraction and geometric fidelity, 2) Texture-Spatial Feature Alignment (TSFA) using hybrid attention (trainable cross-attention + frozen self-attention) to inject reference features into a frozen denoising U-Net.

Result: Extensive experiments show AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garments with improved structural realism and high-frequency detail fidelity across multiple settings.

Conclusion: AlignVTOFF effectively addresses the texture attenuation problem in VTOFF by combining geometric modeling with texture-spatial alignment, achieving superior garment generation quality with preserved high-frequency details.

Abstract: Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during generation.To address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising process.Extensive experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.

[280] PhysSFI-Net: Physics-informed Geometric Learning of Skeletal and Facial Interactions for Orthognathic Surgical Outcome Prediction

Jiahao Bao, Huazhen Liu, Yu Zhuang, Leran Tao, Xinyu Xu, Yongtao Shi, Mengjia Cheng, Yiming Wang, Congshuang Ku, Ting Zeng, Yilang Du, Siyi Chen, Shunyao Shen, Suncheng Xiang, Hongbo Yu

Main category: cs.CV

TL;DR: PhysSFI-Net: A physics-informed geometric deep learning framework for predicting soft tissue deformation after orthognathic surgery with superior accuracy compared to state-of-the-art methods.

Details

Motivation: Orthognathic surgery requires accurate postoperative facial morphology prediction for preoperative planning. Traditional biomechanical models are computationally expensive, while existing geometric deep learning approaches lack interpretability.

Method: PhysSFI-Net combines three components: 1) hierarchical graph module with craniofacial and surgical plan encoders using attention mechanisms to extract skeletal-facial interaction features, 2) LSTM-based sequential predictor for incremental soft tissue deformation, and 3) biomechanics-inspired module for high-resolution facial surface reconstruction.

Result: On 135 patient dataset, PhysSFI-Net achieved point cloud shape error of 1.070 ± 0.088 mm, surface deviation error of 1.296 ± 0.349 mm, and landmark localization error of 2.445 ± 1.326 mm, outperforming state-of-the-art ACMT-Net.

Conclusion: PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.

Abstract: Orthognathic surgery repositions jaw bones to restore occlusion and enhance facial aesthetics. Accurate simulation of postoperative facial morphology is essential for preoperative planning. However, traditional biomechanical models are computationally expensive, while geometric deep learning approaches often lack interpretability. In this study, we develop and validate a physics-informed geometric deep learning framework named PhysSFI-Net for precise prediction of soft tissue deformation following orthognathic surgery. PhysSFI-Net consists of three components: a hierarchical graph module with craniofacial and surgical plan encoders combined with attention mechanisms to extract skeletal-facial interaction features; a Long Short-Term Memory (LSTM)-based sequential predictor for incremental soft tissue deformation; and a biomechanics-inspired module for high-resolution facial surface reconstruction. Model performance was assessed using point cloud shape error (Hausdorff distance), surface deviation error, and landmark localization error (Euclidean distances of craniomaxillofacial landmarks) between predicted facial shapes and corresponding ground truths. A total of 135 patients who underwent combined orthodontic and orthognathic treatment were included for model training and validation. Quantitative analysis demonstrated that PhysSFI-Net achieved a point cloud shape error of 1.070 +/- 0.088 mm, a surface deviation error of 1.296 +/- 0.349 mm, and a landmark localization error of 2.445 +/- 1.326 mm. Comparative experiments indicated that PhysSFI-Net outperformed the state-of-the-art method ACMT-Net in prediction accuracy. In conclusion, PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.

[281] NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu

Main category: cs.CV

TL;DR: NextFlow is a unified autoregressive transformer that achieves fast, high-quality multimodal generation using next-scale prediction for images instead of traditional raster-scan methods.

Details

Motivation: The paper is motivated by the distinct nature of modalities - text is strictly sequential while images are inherently hierarchical. Traditional raster-scan methods for autoregressive image generation are inefficient, so the authors seek a better approach that respects the hierarchical nature of visual data while maintaining unified multimodal capabilities.

Method: NextFlow uses a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image tokens. Key innovations include: 1) next-scale prediction for visual generation instead of next-token prediction (which is retained for text), 2) a robust training recipe to address multi-scale generation instabilities, and 3) a prefix-tuning strategy for reinforcement learning.

Result: NextFlow generates 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable autoregressive models. It achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

Conclusion: The paper demonstrates that a unified autoregressive architecture with modality-appropriate generation strategies (next-token for text, next-scale for images) can achieve efficient, high-quality multimodal understanding and generation, unlocking capabilities like image editing, interleaved content creation, and video generation.

Abstract: We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

[282] MCD-Net: A Lightweight Deep Learning Baseline for Optical-Only Moraine Segmentation

Zhehuan Cao, Fiseha Berhanu Tesema, Ping Fu, Jianfeng Ren, Ahmed Nasr

Main category: cs.CV

TL;DR: First large-scale optical-only moraine segmentation dataset with 3,340 annotated images from Chinese glaciated regions, plus MCD-Net - a lightweight model achieving 62.3% mIoU with 60%+ computational reduction.

Details

Motivation: Glacial segmentation is crucial for studying past glacier dynamics and climate-driven landscape changes, but automated mapping is hindered by weak optical contrast and limited high-resolution DEM availability.

Method: Created first large-scale optical-only moraine segmentation dataset (3,340 manually annotated high-res Google Earth images from Sichuan/Yunnan, China). Developed MCD-Net - lightweight baseline integrating MobileNetV2 encoder, CBAM attention module, and DeepLabV3+ decoder.

Result: MCD-Net achieved 62.3% mean Intersection over Union and 72.8% Dice coefficient while reducing computational cost by over 60% compared to deeper backbones (ResNet152, Xception). Ridge delineation remains challenging due to sub-pixel width and spectral ambiguity.

Conclusion: Optical imagery alone can provide reliable moraine-body segmentation. The publicly available dataset and code establish a reproducible benchmark for moraine-specific segmentation and offer a deployable baseline for high-altitude glacial monitoring.

Abstract: Glacial segmentation is essential for reconstructing past glacier dynamics and evaluating climate-driven landscape change. However, weak optical contrast and the limited availability of high-resolution DEMs hinder automated mapping. This study introduces the first large-scale optical-only moraine segmentation dataset, comprising 3,340 manually annotated high-resolution images from Google Earth covering glaciated regions of Sichuan and Yunnan, China. We develop MCD-Net, a lightweight baseline that integrates a MobileNetV2 encoder, a Convolutional Block Attention Module (CBAM), and a DeepLabV3+ decoder. Benchmarking against deeper backbones (ResNet152, Xception) shows that MCD-Net achieves 62.3% mean Intersection over Union (mIoU) and 72.8% Dice coefficient while reducing computational cost by more than 60%. Although ridge delineation remains constrained by sub-pixel width and spectral ambiguity, the results demonstrate that optical imagery alone can provide reliable moraine-body segmentation. The dataset and code are publicly available at https://github.com/Lyra-alpha/MCD-Net, establishing a reproducible benchmark for moraine-specific segmentation and offering a deployable baseline for high-altitude glacial monitoring.

[283] Seeing the Unseen: Zooming in the Dark with Event Cameras

Dachun Kai, Zeyu Xiao, Huyue Zhu, Jiaxiao Wang, Yueyi Zhang, Xiaoyan Sun

Main category: cs.CV

TL;DR: RetinexEVSR is the first event-driven low-light video super-resolution framework that uses high-contrast event signals and Retinex priors to restore high-resolution videos from low-light, low-resolution inputs.

Details

Motivation: Existing LVSR methods struggle to recover fine details due to limited contrast and insufficient high-frequency information in low-light conditions. There's a need for better approaches that can effectively handle the challenges of low-light video super-resolution.

Method: Proposes RetinexEVSR with: 1) Bidirectional cross-modal fusion strategy to integrate cues from noisy event data and degraded RGB frames, 2) Illumination-guided event enhancement module that refines event features using Retinex-derived illumination maps, 3) Event-guided reflectance enhancement module that recovers reflectance details via multi-scale fusion using enhanced event features.

Result: Achieves state-of-the-art performance on three datasets. On SDSD benchmark: up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods.

Conclusion: RetinexEVSR effectively addresses low-light video super-resolution challenges by leveraging event signals and Retinex priors, achieving superior performance with significant efficiency improvements over existing methods.

Abstract: This paper addresses low-light video super-resolution (LVSR), aiming to restore high-resolution videos from low-light, low-resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high-frequency information. To overcome these challenges, we present RetinexEVSR, the first event-driven LVSR framework that leverages high-contrast event signals and Retinex-inspired priors to enhance video quality under low-light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross-modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination-guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low-light artifacts while preserving high-contrast details. Furthermore, we propose an event-guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi-scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state-of-the-art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods. Code: https://github.com/DachunKai/RetinexEVSR.

[284] InpaintHuman: Reconstructing Occluded Humans with Multi-Scale UV Mapping and Identity-Preserving Diffusion Inpainting

Jinlong Fan, Shanshan Zhao, Liang Zheng, Jing Zhang, Yuxiang Yang, Mingming Gong

Main category: cs.CV

TL;DR: InpaintHuman: A method for reconstructing complete, animatable 3D human avatars from occluded monocular videos using multi-scale UV representation and identity-preserving diffusion inpainting.

Details

Motivation: Existing 3D Gaussian Splatting methods struggle with severe occlusions in monocular videos, producing corrupted geometry and temporal inconsistencies when reconstructing human avatars.

Method: Two key innovations: (1) multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation for robust reconstruction of occluded regions, and (2) identity-preserving diffusion inpainting module integrating textual inversion with semantic-conditioned guidance for subject-specific completion.

Result: Competitive performance on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion), with consistent improvements in reconstruction quality across diverse poses and viewpoints.

Conclusion: InpaintHuman successfully addresses occlusion challenges in monocular 3D human avatar reconstruction, producing high-fidelity, complete, and animatable avatars through novel representation and inpainting techniques.

Abstract: Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.

[285] 360-GeoGS: Geometrically Consistent Feed-Forward 3D Gaussian Splatting Reconstruction for 360 Images

Jiaqi Yao, Zhongmiao Yan, Jingyi Xu, Songpengcheng Xia, Yan Xiang, Ling Pei

Main category: cs.CV

TL;DR: A feed-forward 3D Gaussian Splatting framework for 360 images that improves geometric consistency while maintaining high rendering quality through Depth-Normal geometric regularization.

Details

Motivation: Traditional multi-view stereo struggles with sparse viewpoints and low-texture regions, while neural rendering approaches require per-scene optimization and lack real-time efficiency. Existing feed-forward 3DGS variants focus on visual quality rather than geometric consistency, limiting accurate surface reconstruction for spatial perception tasks.

Method: Proposes a novel feed-forward 3DGS framework for 360 images that introduces Depth-Normal geometric regularization to couple rendered depth gradients with normal information. This supervises Gaussian rotation, scale, and position to improve point cloud and surface accuracy while maintaining high rendering quality.

Result: Experimental results show the method maintains high rendering quality while significantly improving geometric consistency, providing an effective solution for 3D reconstruction in spatial perception tasks.

Conclusion: The proposed framework successfully addresses the geometric consistency limitations of existing feed-forward 3DGS approaches, enabling both high-quality rendering and accurate surface reconstruction for spatial intelligence applications like AR, robotics, and digital twins.

Abstract: 3D scene reconstruction is fundamental for spatial intelligence applications such as AR, robotics, and digital twins. Traditional multi-view stereo struggles with sparse viewpoints or low-texture regions, while neural rendering approaches, though capable of producing high-quality results, require per-scene optimization and lack real-time efficiency. Explicit 3D Gaussian Splatting (3DGS) enables efficient rendering, but most feed-forward variants focus on visual quality rather than geometric consistency, limiting accurate surface reconstruction and overall reliability in spatial perception tasks. This paper presents a novel feed-forward 3DGS framework for 360 images, capable of generating geometrically consistent Gaussian primitives while maintaining high rendering quality. A Depth-Normal geometric regularization is introduced to couple rendered depth gradients with normal information, supervising Gaussian rotation, scale, and position to improve point cloud and surface accuracy. Experimental results show that the proposed method maintains high rendering quality while significantly improving geometric consistency, providing an effective solution for 3D reconstruction in spatial perception tasks.

[286] VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich

Main category: cs.CV

TL;DR: A compact, high-throughput instruction-based image editing pipeline using 2B-parameter Qwen3-VL for guidance and 1.6B-parameter Sana1.5 for generation, achieving competitive performance with much larger models while being computationally efficient.

Details

Motivation: Most open-source instruction-based image editing models are large (6B-20B parameters) and computationally expensive, limiting deployment in resource-constrained settings. There's a need for compact, efficient models that maintain high quality.

Method: Uses Qwen3-VL (2B parameters) to guide editing process and Sana1.5 (1.6B parameters) for image generation. Focuses on architecture, data processing, training configuration, and evaluation optimized for low-cost inference and source consistency.

Result: Matches or exceeds performance of substantially larger baselines on ImgEdit and GEdit benchmarks. Particularly strong on edits requiring input preservation (attribute adjustment, object removal, background edits, targeted replacement). Fits in 24GB GPU memory, generates 2K images in ~4 seconds on H100.

Conclusion: Demonstrates that compact models (under 4B total parameters) can achieve real-world quality instruction-based image editing, enabling efficient deployment in resource-constrained environments while maintaining competitive performance with much larger models.

Abstract: Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.

[287] HeadLighter: Disentangling Illumination in Generative 3D Gaussian Heads via Lightstage Captures

Yating Wang, Yuan Sun, Xuan Wang, Ran Yi, Boyao Zhou, Yipengjing Sun, Hongyu Liu, Yinuo Wang, Lizhuang Ma

Main category: cs.CV

TL;DR: HeadLighter: A supervised framework for disentangling illumination and appearance in 3D head generative models using dual-branch architecture and progressive training with light stage data.

Details

Motivation: Current 3D-aware head generative models based on 3D Gaussian Splatting achieve real-time, photorealistic synthesis but suffer from deep entanglement of illumination and intrinsic appearance, preventing controllable relighting. Existing disentanglement methods rely on strong assumptions that restrict their capacity for complex illumination.

Method: Introduces HeadLighter with dual-branch architecture that separately models lighting-invariant head attributes and physically grounded rendering components. Uses progressive disentanglement training to inject head appearance priors, supervised by multi-view images captured under controlled light conditions with a light stage setup. Includes distillation strategy to generate high-quality normals for realistic rendering.

Result: Method preserves high-quality generation and real-time rendering while simultaneously supporting explicit lighting and viewpoint editing. Will publicly release code and dataset.

Conclusion: HeadLighter addresses the fundamental limitation of illumination-appearance entanglement in 3D head generative models, enabling physically plausible decomposition and controllable relighting through supervised learning with controlled lighting data.

Abstract: Recent 3D-aware head generative models based on 3D Gaussian Splatting achieve real-time, photorealistic and view-consistent head synthesis. However, a fundamental limitation persists: the deep entanglement of illumination and intrinsic appearance prevents controllable relighting. Existing disentanglement methods rely on strong assumptions to enable weakly supervised learning, which restricts their capacity for complex illumination. To address this challenge, we introduce HeadLighter, a novel supervised framework that learns a physically plausible decomposition of appearance and illumination in head generative models. Specifically, we design a dual-branch architecture that separately models lighting-invariant head attributes and physically grounded rendering components. A progressive disentanglement training is employed to gradually inject head appearance priors into the generative architecture, supervised by multi-view images captured under controlled light conditions with a light stage setup. We further introduce a distillation strategy to generate high-quality normals for realistic rendering. Experiments demonstrate that our method preserves high-quality generation and real-time rendering, while simultaneously supporting explicit lighting and viewpoint editing. We will publicly release our code and dataset.

[288] A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

Annoor Sharara Akhand

Main category: cs.CV

TL;DR: Comparison of three CNN paradigms (custom training, pre-trained feature extraction, transfer learning) across five real-world image classification tasks shows transfer learning performs best, while custom CNNs offer good efficiency-accuracy trade-off for constrained resources.

Details

Motivation: To provide a controlled comparison of three common CNN approaches (custom training from scratch, using pre-trained models as fixed feature extractors, and transfer learning via fine-tuning) across diverse real-world image classification problems, helping practitioners make informed decisions based on performance and efficiency trade-offs.

Method: Systematic evaluation across five real-world image classification datasets: road-surface defect recognition, agricultural variety identification, fruit/leaf disease recognition, pedestrian walkway encroachment recognition, and unauthorized vehicle recognition. Models were assessed using accuracy and macro F1-score, with additional efficiency metrics including training time per epoch and parameter counts.

Result: Transfer learning consistently yielded the strongest predictive performance across all datasets. Custom CNNs provided an attractive efficiency-accuracy trade-off, particularly when compute and memory budgets are constrained. The pre-trained feature extraction approach showed intermediate performance.

Conclusion: Transfer learning is recommended for best predictive performance, while custom CNNs offer practical advantages for resource-constrained scenarios, providing practitioners with clear guidance on selecting CNN approaches based on their specific performance and efficiency requirements.

Abstract: Convolutional Neural Networks (CNNs) are a standard approach for visual recognition due to their capacity to learn hierarchical representations from raw pixels. In practice, practitioners often choose among (i) training a compact custom CNN from scratch, (ii) using a large pre-trained CNN as a fixed feature extractor, and (iii) performing transfer learning via partial or full fine-tuning of a pre-trained backbone. This report presents a controlled comparison of these three paradigms across five real-world image classification datasets spanning road-surface defect recognition, agricultural variety identification, fruit/leaf disease recognition, pedestrian walkway encroachment recognition, and unauthorized vehicle recognition. Models are evaluated using accuracy and macro F1-score, complemented by efficiency metrics including training time per epoch and parameter counts. The results show that transfer learning consistently yields the strongest predictive performance, while the custom CNN provides an attractive efficiency–accuracy trade-off, especially when compute and memory budgets are constrained.

[289] MagicFight: Personalized Martial Arts Combat Video Generation

Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, Shifeng Chen

Main category: cs.CV

TL;DR: MagicFight introduces personalized martial arts combat video generation for two-person interactions, addressing limitations of single-person models through a custom Unity-generated dataset and adapted generation techniques.

Details

Motivation: Existing personalized video generation models focus on single-person scenarios (like dancing) but fail to capture the complexities of two-person interactions, especially martial arts combat, leading to identity confusion, anomalous limbs, and action mismatches.

Method: 1) Creates a custom dataset using Unity physics engine with diverse 3D characters, martial arts moves, and scenes; 2) Refines and adapts existing models and strategies to handle two-person combat scenarios; 3) Focuses on maintaining individual identities and coherent action sequences.

Result: MagicFight generates high-fidelity two-person combat videos that preserve individual identities and ensure seamless, coherent action sequences, establishing a foundation for interactive video content creation in multi-person scenarios.

Conclusion: The work pioneers personalized martial arts combat video generation, addresses the lack of appropriate datasets through Unity-based data creation, and lays groundwork for future innovations in interactive multi-person video content generation.

Abstract: Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation. Website: https://MingfuYAN.github.io/MagicFight/ Dataset: https://huggingface.co/datasets/MingfuYAN/KungFu-Fiesta

[290] TopoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation

Salim Khazem

Main category: cs.CV

TL;DR: TopoLoRA-SAM adapts SAM for domain-specific binary segmentation using topology-aware LoRA adapters, achieving state-of-the-art results with only 5.2% parameter training.

Details

Motivation: Foundation models like SAM have strong zero-shot capabilities but struggle with domain-specific semantic segmentation, especially for thin structures and noisy modalities. Full fine-tuning is computationally expensive and risks catastrophic forgetting.

Method: Proposes TopoLoRA-SAM: injects Low-Rank Adaptation (LoRA) into frozen ViT encoder, adds lightweight spatial convolutional adapter, and optionally uses topology-aware supervision via differentiable clDice loss.

Result: Achieves best retina-average Dice and best overall average Dice across five benchmarks (retinal vessels, polyp segmentation, SAR sea/land). Trains only 5.2% of parameters (~4.9M). Substantially improves accuracy on challenging CHASE_DB1 dataset.

Conclusion: Topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models while being computationally efficient and avoiding catastrophic forgetting.

Abstract: Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbf{TopoLoRA-SAM}, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf{5.2%} of model parameters ($\sim$4.9M). On the challenging CHASE_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git

[291] Car Drag Coefficient Prediction from 3D Point Clouds Using a Slice-Based Surrogate Model

Utkarsh Singh, Absaar Ali, Adarsh Roy

Main category: cs.CV

TL;DR: A lightweight surrogate model for aerodynamic drag prediction using sequential slice-wise processing of 3D vehicle point clouds achieves high accuracy with fast inference times.

Details

Motivation: Traditional aerodynamic evaluation methods (CFD, wind tunnel testing) are resource-intensive and slow for early design iteration. Existing ML surrogate models often have high computational complexity, limited interpretability, or insufficient accuracy for detailed geometric inputs.

Method: 3D vehicle point clouds are decomposed into ordered 2D cross-sectional slices along the stream-wise axis. Each slice is encoded by a lightweight PointNet2D module, and the sequence of slice embeddings is processed by a bidirectional LSTM to capture longitudinal geometric evolution.

Result: The model achieves R² > 0.9528 and MAE ≈ 6.046 × 10⁻³ in Cd prediction on the DrivAerNet++ dataset, with inference time of ~0.025 seconds per sample on consumer-grade GPU.

Conclusion: The approach provides fast, accurate, and interpretable aerodynamic feedback, enabling more agile and informed automotive design exploration in early development stages.

Abstract: The automotive industry’s pursuit of enhanced fuel economy and performance necessitates efficient aerodynamic design. However, traditional evaluation methods such as computational fluid dynamics (CFD) and wind tunnel testing are resource intensive, hindering rapid iteration in the early design stages. Machine learning-based surrogate models offer a promising alternative, yet many existing approaches suffer from high computational complexity, limited interpretability, or insufficient accuracy for detailed geometric inputs. This paper introduces a novel lightweight surrogate model for the prediction of the aerodynamic drag coefficient (Cd) based on a sequential slice-wise processing of the geometry of the 3D vehicle. Inspired by medical imaging, 3D point clouds of vehicles are decomposed into an ordered sequence of 2D cross-sectional slices along the stream-wise axis. Each slice is encoded by a lightweight PointNet2D module, and the sequence of slice embeddings is processed by a bidirectional LSTM to capture longitudinal geometric evolution. The model, trained and evaluated on the DrivAerNet++ dataset, achieves a high coefficient of determination (R^2 > 0.9528) and a low mean absolute error (MAE approx 6.046 x 10^{-3}) in Cd prediction. With an inference time of approximately 0.025 seconds per sample on a consumer-grade GPU, our approach provides fast, accurate, and interpretable aerodynamic feedback, facilitating more agile and informed automotive design exploration.

[292] Beyond Segmentation: An Oil Spill Change Detection Framework Using Synthetic SAR Imagery

Chenyang Lai, Shuaiyu Chen, Tianjin Huang, Siyang Song, Guangliang Cheng, Chunbo Luo, Zeyu Fu

Main category: cs.CV

TL;DR: This paper introduces OSCD (Oil Spill Change Detection), a bi-temporal approach for oil spill detection using SAR imagery, addressing false positives from static single-image methods by comparing pre- and post-spill images, with a novel TAHI framework for synthetic pre-spill image generation when real data is unavailable.

Details

Motivation: Current oil spill detection methods using SAR imagery suffer from high false positive rates due to difficulty distinguishing oil spills from visually similar oceanic features like biogenic slicks or low-wind zones. Static single-image approaches lack temporal context and struggle with generalization, especially in data-scarce conditions.

Method: The paper proposes OSCD (Oil Spill Change Detection) as a bi-temporal task comparing pre- and post-spill SAR images. To address the lack of real pre-spill imagery, they introduce TAHI (Temporal-Aware Hybrid Inpainting) framework with two components: 1) High-Fidelity Hybrid Inpainting for oil-free reconstruction, and 2) Temporal Realism Enhancement for radiometric and sea-state consistency. They use TAHI to create the first OSCD dataset and benchmark state-of-the-art change detection models.

Result: OSCD significantly reduces false positives and improves detection accuracy compared to conventional segmentation methods. The approach demonstrates the value of temporally-aware methods for reliable, scalable oil spill monitoring in real-world scenarios.

Conclusion: The bi-temporal OSCD approach with TAHI framework provides a more reliable solution for oil spill detection by leveraging temporal context, overcoming limitations of static single-image methods, and enabling better performance even when real pre-spill imagery is unavailable through synthetic image generation.

Abstract: Marine oil spills are urgent environmental hazards that demand rapid and reliable detection to minimise ecological and economic damage. While Synthetic Aperture Radar (SAR) imagery has become a key tool for large-scale oil spill monitoring, most existing detection methods rely on deep learning-based segmentation applied to single SAR images. These static approaches struggle to distinguish true oil spills from visually similar oceanic features (e.g., biogenic slicks or low-wind zones), leading to high false positive rates and limited generalizability, especially under data-scarce conditions. To overcome these limitations, we introduce Oil Spill Change Detection (OSCD), a new bi-temporal task that focuses on identifying changes between pre- and post-spill SAR images. As real co-registered pre-spill imagery is not always available, we propose the Temporal-Aware Hybrid Inpainting (TAHI) framework, which generates synthetic pre-spill images from post-spill SAR data. TAHI integrates two key components: High-Fidelity Hybrid Inpainting for oil-free reconstruction, and Temporal Realism Enhancement for radiometric and sea-state consistency. Using TAHI, we construct the first OSCD dataset and benchmark several state-of-the-art change detection models. Results show that OSCD significantly reduces false positives and improves detection accuracy compared to conventional segmentation, demonstrating the value of temporally-aware methods for reliable, scalable oil spill monitoring in real-world scenarios.

[293] Efficient Unrolled Networks for Large-Scale 3D Inverse Problems

Romain Vo, Julián Tachella

Main category: cs.CV

TL;DR: A domain partitioning strategy and normal operator approximations enable training of end-to-end reconstruction models for large-scale 3D imaging problems on a single GPU.

Details

Motivation: Deep learning methods for imaging inverse problems struggle with large-scale 3D imaging due to memory constraints from global forward operators, preventing operator incorporation in network architectures and hindering patching strategies.

Method: Proposes domain partitioning strategy and normal operator approximations that allow training of end-to-end reconstruction models with forward operators for arbitrarily large problems, enabling single GPU training and inference.

Result: Achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI while requiring only a single GPU for both training and inference.

Conclusion: The proposed approach successfully enables operator-incorporated deep learning reconstruction for large-scale 3D imaging problems with practical computational requirements.

Abstract: Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.

[294] Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao, Jianxin Li

Main category: cs.CV

TL;DR: LAid improves long-context understanding in small vision-language models through attention distillation with progressive distance weighting and learnable RoPE modulation.

Details

Motivation: Small VLMs have limited window sizes that fail at linguistics-photography alignment, and conventional methods don't effectively transfer long-range attention mechanisms from large models.

Method: LAid uses two components: 1) progressive distance-weighted attention matching that emphasizes longer position differences during training, and 2) learnable RoPE response gain modulation that selectively amplifies position sensitivity.

Result: LAid-distilled models achieve up to 3.2× longer effective context windows compared to baseline small models while maintaining/improving performance on standard VL benchmarks.

Conclusion: LAid provides practical techniques for efficient long-context VLMs and theoretical insights into how positional understanding emerges and transfers during distillation.

Abstract: While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students’ capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

[295] Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

Oliver Custance, Saad Khan, Simon Parkinson

Main category: cs.CV

TL;DR: Commodity ESP32 WiFi sensors cannot reliably separate multiple people using CSI signals, with all tested methods achieving only 45-56% accuracy regardless of algorithm complexity.

Details

Motivation: While WiFi CSI shows promise for single-person gait identification, multi-person identification remains unexplored with commodity hardware. The key question is whether poor multi-person performance is due to algorithmic limitations or fundamental hardware constraints of low-cost sensors.

Method: Systematically evaluated six signal separation methods (FastICA, SOBI, PCA, NMF, Wavelet, Tensor Decomposition) across seven scenarios with 1-10 people using commodity ESP32 WiFi sensors. Used novel diagnostic metrics: intra-subject variability, inter-subject distinguishability, and performance degradation rate.

Result: All methods achieved similarly low accuracy (45-56%, σ=3.74%) with statistically insignificant differences (p > 0.05). Best-performing method (NMF) achieved only 56% accuracy. Analysis showed high intra-subject variability, low inter-subject distinguishability, and severe performance degradation as person count increased.

Conclusion: Commodity ESP32 sensors cannot provide sufficient signal quality for reliable multi-person separation. The limitation is fundamental hardware constraint rather than algorithmic limitation, explaining why previous complex setups required modified firmware.

Abstract: WiFi Channel State Information (CSI) has shown promise for single-person gait identification, with numerous studies reporting high accuracy. However, multi-person identification remains largely unexplored, with the limited existing work relying on complex, expensive setups requiring modified firmware. A critical question remains unanswered: is poor multi-person performance an algorithmic limitation or a fundamental hardware constraint? We systematically evaluate six diverse signal separation methods (FastICA, SOBI, PCA, NMF, Wavelet, Tensor Decomposition) across seven scenarios with 1-10 people using commodity ESP32 WiFi sensors–a simple, low-cost, off-the-shelf solution. Through novel diagnostic metrics (intra-subject variability, inter-subject distinguishability, performance degradation rate), we reveal that all methods achieve similarly low accuracy (45-56%, $σ$=3.74%) with statistically insignificant differences (p $>$ 0.05). Even the best-performing method, NMF, achieves only 56% accuracy. Our analysis reveals high intra-subject variability, low inter-subject distinguishability, and severe performance degradation as person count increases, indicating that commodity ESP32 sensors cannot provide sufficient signal quality for reliable multi-person separation.

[296] QuIC: A Quantum-Inspired Interaction Classifier for Revitalizing Shallow CNNs in Fine-Grained Recognition

Cheng Ying Wu, Yen Jui Chang

Main category: cs.CV

TL;DR: QuIC is a quantum-inspired lightweight module that boosts shallow networks for fine-grained visual classification by capturing second-order feature interactions without exploding dimensionality.

Details

Motivation: Deep learning models for Fine-Grained Visual Classification (FGVC) are too computationally expensive for edge devices, while shallow networks lack the ability to capture subtle feature interactions needed for distinguishing visually similar sub-categories. Existing solutions like Bilinear CNNs have issues with high dimensionality and training instability.

Method: QuIC (Quantum-inspired Interaction Classifier) models feature channels as interacting quantum states and captures second-order feature covariance through a learnable observable operator. It’s designed as a lightweight plug-and-play module that supports stable, single-stage end-to-end training without exploding feature dimensions.

Result: QuIC significantly revitalizes shallow backbones: boosts VGG16 Top-1 accuracy by nearly 20%, outperforms state-of-the-art attention mechanisms (SE-Block) on ResNet18. Qualitative analysis shows QuIC resolves ambiguous cases by attending to fine-grained discriminative features and enforcing compact intra-class clustering.

Conclusion: QuIC provides an effective solution for deploying FGVC on resource-constrained devices by enabling shallow networks to capture the necessary high-order feature interactions through quantum-inspired modeling, achieving both efficiency and accuracy.

Abstract: Deploying deep learning models for Fine-Grained Visual Classification (FGVC) on resource-constrained edge devices remains a significant challenge. While deep architectures achieve high accuracy on benchmarks like CUB-200-2011, their computational cost is often prohibitive. Conversely, shallow networks (e.g., AlexNet, VGG) offer efficiency but fail to distinguish visually similar sub-categories. This is because standard Global Average Pooling (GAP) heads capture only first-order statistics, missing the subtle high-order feature interactions required for FGVC. While Bilinear CNNs address this, they suffer from high feature dimensionality and instability during training. To bridge this gap, we propose the Quantum-inspired Interaction Classifier (QuIC). Drawing inspiration from quantum mechanics, QuIC models feature channels as interacting quantum states and captures second-order feature covariance via a learnable observable operator. Designed as a lightweight, plug-and-play module, QuIC supports stable, single-stage end-to-end training without exploding feature dimensions. Experimental results demonstrate that QuIC significantly revitalizes shallow backbones: it boosts the Top-1 accuracy of VGG16 by nearly 20% and outperforms state-of-the-art attention mechanisms (SE-Block) on ResNet18. Qualitative analysis, including t-SNE visualization, further confirms that QuIC resolves ambiguous cases by explicitly attending to fine-grained discriminative features and enforcing compact intra-class clustering.

[297] Mind the Gap: Continuous Magnification Sampling for Pathology Foundation Models

Alexander Möllers, Julius Hense, Florian Schulz, Timo Milbich, Maximilian Alber, Lukas Ruff

Main category: cs.CV

TL;DR: Continuous magnification sampling outperforms discrete uniform sampling in histopathology foundation models, improving performance at intermediate magnifications by up to 4 percentage points.

Details

Motivation: Pathologists examine tissue at multiple magnifications, but current foundation models have poorly understood performance across magnifications and the effect of magnification sampling during training.

Method: Model magnification sampling as multi-source domain adaptation, develop theoretical framework, introduce continuous magnification sampling, derive optimized sampling distributions, and create new benchmarks (TCGA-MS, BRACS-MS) for evaluation.

Result: Continuous sampling substantially improves over discrete sampling at intermediate magnifications with up to 4% gain in balanced classification accuracy; optimized distributions further improve performance; magnification is a primary driver of performance variation across models.

Conclusion: Continuous magnification sampling and optimized distributions enable more reliable pathology foundation models across magnifications, paving the way for better multi-scale histopathology analysis.

Abstract: In histopathology, pathologists examine both tissue architecture at low magnification and fine-grained morphology at high magnification. Yet, the performance of pathology foundation models across magnifications and the effect of magnification sampling during training remain poorly understood. We model magnification sampling as a multi-source domain adaptation problem and develop a simple theoretical framework that reveals systematic trade-offs between sampling strategies. We show that the widely used discrete uniform sampling of magnifications (0.25, 0.5, 1.0, 2.0 mpp) leads to degradation at intermediate magnifications. We introduce continuous magnification sampling, which removes gaps in magnification coverage while preserving performance at standard scales. Further, we derive sampling distributions that optimize representation quality across magnification scales. To evaluate these strategies, we introduce two new benchmarks (TCGA-MS, BRACS-MS) with appropriate metrics. Our experiments show that continuous sampling substantially improves over discrete sampling at intermediate magnifications, with gains of up to 4 percentage points in balanced classification accuracy, and that optimized distributions can further improve performance. Finally, we evaluate current histopathology foundation models, finding that magnification is a primary driver of performance variation across models. Our work paves the way towards future pathology foundation models that perform reliably across magnifications.

[298] Parameter-Efficient Domain Adaption for CSI Crowd-Counting via Self-Supervised Learning with Adapter Modules

Oliver Custance, Saad Khan, Simon Parkinson, Quan Z. Sheng

Main category: cs.CV

TL;DR: A novel two-stage WiFi CSI framework for device-free crowd counting that addresses domain shift via self-supervised contrastive learning and adapter-based fine-tuning, achieving state-of-the-art performance with minimal parameter updates.

Details

Motivation: Device-free crowd counting using WiFi CSI is promising for privacy-preserving IoT applications, but practical deployment is hindered by the domain shift problem where models trained in one environment fail to generalize to others.

Method: Two-stage framework with CSI-ResNet-A architecture: 1) Pre-training via self-supervised contrastive learning to learn domain-invariant representations, 2) Lightweight Adapter modules for efficient fine-tuning, followed by stateful counting machine for stable occupancy estimates.

Result: Achieves MAE of 0.44 in 10-shot learning on WiFlow dataset (where supervised baselines fail), near-perfect Generalisation Index, and sets new SOTA on WiAR benchmark with 98.8% accuracy. Adapter fine-tuning achieves 98.84% vs 99.67% full fine-tune while training 97.2% fewer parameters.

Conclusion: Provides a practical and scalable solution for robust WiFi sensing systems ready for real-world IoT deployments by effectively addressing domain shift through domain-invariant representation learning and parameter-efficient adaptation.

Abstract: Device-free crowd-counting using WiFi Channel State Information (CSI) is a key enabling technology for a new generation of privacy-preserving Internet of Things (IoT) applications. However, practical deployment is severely hampered by the domain shift problem, where models trained in one environment fail to generalise to another. To overcome this, we propose a novel two-stage framework centred on a CSI-ResNet-A architecture. This model is pre-trained via self-supervised contrastive learning to learn domain-invariant representations and leverages lightweight Adapter modules for highly efficient fine-tuning. The resulting event sequence is then processed by a stateful counting machine to produce a final, stable occupancy estimate. We validate our framework extensively. On our WiFlow dataset, our unsupervised approach excels in a 10-shot learning scenario, achieving a final Mean Absolute Error (MAE) of just 0.44–a task where supervised baselines fail. To formally quantify robustness, we introduce the Generalisation Index (GI), on which our model scores near-perfectly, confirming its ability to generalise. Furthermore, our framework sets a new state-of-the-art public WiAR benchmark with 98.8% accuracy. Our ablation studies reveal the core strength of our design: adapter-based fine-tuning achieves performance within 1% of a full fine-tune (98.84% vs. 99.67%) while training 97.2% fewer parameters. Our work provides a practical and scalable solution for developing robust sensing systems ready for real-world IoT deployments.

[299] Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion

Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li

Main category: cs.CV

TL;DR: The paper develops a systematic pipeline to analyze MMDiT-based diffusion models, revealing block-specific functionalities and proposing training-free strategies for improved text alignment, editing, and acceleration.

Details

Motivation: While transformer-based diffusion models like FLUX and Qwen Image show impressive text-to-image capabilities, there's limited understanding of how different blocks interact with textual conditions during synthesis. Existing analyses focus on specific components but lack comprehensive insights into block-level contributions.

Method: Developed a systematic pipeline to investigate each block’s functionality by removing, disabling, and enhancing textual hidden-states at corresponding blocks. Built on these observations to propose novel training-free strategies for improved text alignment, precise editing, and acceleration.

Result: Key findings: 1) Semantic information appears in earlier blocks while finer details are rendered in later blocks, 2) Removing specific blocks is less disruptive than disabling text conditions, 3) Enhancing textual conditions in selective blocks improves semantic attributes. The method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5 without quality loss.

Conclusion: The analysis advances understanding of MMDiT models and provides valuable insights for further improvements. The proposed training-free strategies demonstrate effectiveness across text-to-image generation, image editing, and inference acceleration tasks.

Abstract: Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block’s functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.

[300] Prior-Guided DETR for Ultrasound Nodule Detection

Jingjing Wang, Zhuo Xiao, Xinning Yao, Bo Liu, Lijuan Niu, Xiangzhi Bai, Fugen Zhou

Main category: cs.CV

TL;DR: A prior-guided DETR framework for ultrasound nodule detection that incorporates geometric and structural priors to handle irregular shapes, indistinct boundaries, scale variations, and speckle noise.

Details

Motivation: Ultrasound nodule detection is challenging due to irregular shapes, indistinct boundaries, substantial scale variations, and speckle noise that degrades structural visibility. Current methods relying on purely data-driven feature learning struggle with these complex characteristics.

Method: Proposes a prior-guided DETR framework with three key components: 1) Spatially-adaptive Deformable FFN with Prior Regularization (SDFPR) in CNN backbone to inject geometric priors, 2) Multi-scale Spatial-Frequency Feature Mixer (MSFFM) to extract multi-scale structural priors using both spatial and frequency domains, and 3) Dense Feature Interaction (DFI) mechanism to propagate prior-modulated features across encoder layers for consistent guidance.

Result: Superior accuracy compared with 18 detection methods on two clinically collected thyroid ultrasound datasets (Thyroid I and Thyroid II) and two public benchmarks (TN3K and BUSI) for thyroid and breast nodules, particularly in detecting morphologically complex nodules.

Conclusion: The proposed prior-guided DETR framework effectively addresses ultrasound nodule detection challenges by incorporating geometric and structural priors, demonstrating significant improvements over existing methods, especially for complex nodule morphologies.

Abstract: Accurate detection of ultrasound nodules is essential for the early diagnosis and treatment of thyroid and breast cancers. However, this task remains challenging due to irregular nodule shapes, indistinct boundaries, substantial scale variations, and the presence of speckle noise that degrades structural visibility. To address these challenges, we propose a prior-guided DETR framework specifically designed for ultrasound nodule detection. Instead of relying on purely data-driven feature learning, the proposed framework progressively incorporates different prior knowledge at multiple stages of the network. First, a Spatially-adaptive Deformable FFN with Prior Regularization (SDFPR) is embedded into the CNN backbone to inject geometric priors into deformable sampling, stabilizing feature extraction for irregular and blurred nodules. Second, a Multi-scale Spatial-Frequency Feature Mixer (MSFFM) is designed to extract multi-scale structural priors, where spatial-domain processing emphasizes contour continuity and boundary cues, while frequency-domain modeling captures global morphology and suppresses speckle noise. Furthermore, a Dense Feature Interaction (DFI) mechanism propagates and exploits these prior-modulated features across all encoder layers, enabling the decoder to enhance query refinement under consistent geometric and structural guidance. Experiments conducted on two clinically collected thyroid ultrasound datasets (Thyroid I and Thyroid II) and two public benchmarks (TN3K and BUSI) for thyroid and breast nodules demonstrate that the proposed method achieves superior accuracy compared with 18 detection methods, particularly in detecting morphologically complex nodules.The source code is publicly available at https://github.com/wjj1wjj/Ultrasound-DETR.

[301] FMVP: Masked Flow Matching for Adversarial Video Purification

Duoxun Tang, Xueyi Zhang, Chak Hin Wang, Xi Xiao, Dasen Dai, Xinhang Jiang, Wentao Shi, Rui Li, Qing Li

Main category: cs.CV

TL;DR: FMVP is a flow matching-based adversarial video purification method that physically shatters adversarial structures via masking and reconstructs clean videos using conditional flow matching with frequency-gated loss.

Details

Motivation: Video recognition models are vulnerable to adversarial attacks, and existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Direct regression fails to recover faithful content due to subtle perturbations, requiring physical destruction of adversarial structures.

Method: FMVP physically shatters global adversarial structures using a masking strategy and reconstructs clean video dynamics via Conditional Flow Matching (CFM) with inpainting objective. It uses Frequency-Gated Loss (FGL) to suppress high-frequency adversarial residuals while preserving low-frequency fidelity. Two training paradigms: Attack-Aware (for known threats) and Generalist (for unknown threats).

Result: Outperforms state-of-the-art methods (DiffPure, Defense Patterns, Temporal Shuffling, FlowPure) on UCF-101 and HMDB-51, achieving robust accuracy >87% against PGD and >89% against CW attacks. Shows superior robustness against adaptive attacks (DiffHammer) and functions as zero-shot adversarial detector with 98% detection accuracy for PGD and 79% for CW attacks.

Conclusion: FMVP effectively purifies adversarial videos by physically destroying adversarial structures and reconstructing clean content through flow matching with frequency-aware loss, demonstrating strong performance against various attacks and serving as an effective adversarial detector.

Abstract: Video recognition models remain vulnerable to adversarial attacks, while existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Directly regressing clean videos from adversarial inputs often fails to recover faithful content due to the subtle nature of perturbations; this necessitates physically shattering the adversarial structure. Therefore, we propose Flow Matching for Adversarial Video Purification FMVP. FMVP physically shatters global adversarial structures via a masking strategy and reconstructs clean video dynamics using Conditional Flow Matching (CFM) with an inpainting objective. To further decouple semantic content from adversarial noise, we design a Frequency-Gated Loss (FGL) that explicitly suppresses high-frequency adversarial residuals while preserving low-frequency fidelity. We design Attack-Aware and Generalist training paradigms to handle known and unknown threats, respectively. Extensive experiments on UCF-101 and HMDB-51 demonstrate that FMVP outperforms state-of-the-art methods (DiffPure, Defense Patterns (DP), Temporal Shuffling (TS) and FlowPure), achieving robust accuracy exceeding 87% against PGD and 89% against CW attacks. Furthermore, FMVP demonstrates superior robustness against adaptive attacks (DiffHammer) and functions as a zero-shot adversarial detector, attaining detection accuracies of 98% for PGD and 79% for highly imperceptible CW attacks.

[302] SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, Yuxin Hu

Main category: cs.CV

TL;DR: SLGNet is a parameter-efficient multimodal object detection framework that combines hierarchical structural priors with language-guided modulation in frozen ViT backbones to improve cross-modal consistency and environmental awareness.

Details

Motivation: Existing adapter-based approaches for RGB-IR multimodal detection prioritize efficiency over cross-modal structural consistency, losing critical structural cues in challenging conditions (high-contrast, nighttime). Static fusion mechanisms lack environmental awareness, limiting adaptation to dynamic scene variations.

Method: Proposes SLGNet with two key components: 1) Structure-Aware Adapter extracts hierarchical structural representations from both RGB and IR modalities and dynamically injects them into frozen ViT to compensate for structural degradation; 2) Language-Guided Modulation module uses VLM-driven structured captions to dynamically recalibrate visual features for environmental awareness.

Result: Achieves state-of-the-art performance on LLVIP, FLIR, KAIST, and DroneVehicle datasets. On LLVIP benchmark: mAP of 66.1 while reducing trainable parameters by ~87% compared to full fine-tuning.

Conclusion: SLGNet provides a robust and efficient solution for multimodal perception, effectively addressing structural consistency and environmental awareness limitations in existing approaches while maintaining parameter efficiency.

Abstract: Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.

[303] VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia

Main category: cs.CV

TL;DR: Enhanced GRPO framework for VAR models addresses asynchronous policy conflicts in visual generation through stabilizing rewards, dynamic time-step reweighting, and mask propagation from ReFL principles.

Details

Motivation: VAR models in visual generation suffer from heterogeneous input structures across generation steps, creating severe asynchronous policy conflicts that lead to unstable training and suboptimal alignment, especially in RL scenarios.

Method: Proposes three synergistic components: 1) stabilizing intermediate reward for early-stage generation guidance, 2) dynamic time-step reweighting scheme for precise credit assignment, and 3) novel mask propagation algorithm derived from Reward Feedback Learning (ReFL) principles to isolate optimization effects spatially and temporally.

Result: Demonstrates significant improvements in sample quality and objective alignment over vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

Conclusion: The proposed framework successfully resolves asynchronous policy conflicts in VAR models, leading to more stable training and better alignment in visual generation tasks.

Abstract: Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

[304] DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Renke Wang, Zhenyu Zhang, Ying Tai, Jian Yang

Main category: cs.CV

TL;DR: DiffProxy is a novel framework that generates multi-view consistent human proxies using diffusion priors to bridge synthetic training and real-world generalization for human mesh recovery.

Details

Motivation: Real-world datasets have imperfect ground-truth annotations that bias model training, while synthetic data with precise supervision suffers from domain gap issues.

Method: Uses diffusion-based generative priors with: (1) multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) hand refinement module with flexible visual prompts for local details; (3) uncertainty-aware test-time scaling for robustness during optimization.

Result: Achieves state-of-the-art performance across five real-world benchmarks with strong zero-shot generalization, particularly on challenging scenarios with occlusions and partial views.

Conclusion: DiffProxy effectively bridges synthetic training and real-world generalization by leveraging diffusion priors, enabling precise synthetic ground truth to benefit real-world mesh recovery without domain gap issues.

Abstract: Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models’ training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html

[305] InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang

Main category: cs.CV

TL;DR: InfiniteVGGT is a causal visual geometry transformer with rolling memory that enables infinite-horizon streaming 3D geometry estimation, outperforming existing methods in long-term stability.

Details

Motivation: The paper addresses the fundamental dilemma between scalability and long-term stability in 3D visual geometry understanding. Offline models like VGGT have good geometry capability but can't handle live systems, while streaming architectures either fail with infinite-horizon inputs or suffer from catastrophic drift over long sequences.

Method: InfiniteVGGT introduces a causal visual geometry transformer with a rolling memory mechanism using a bounded yet adaptive KV cache. It employs a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively “rolling” the memory forward with each new frame. The method is fully compatible with FlashAttention.

Result: InfiniteVGGT enables infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The paper also introduces the Long3D benchmark for rigorous evaluation of continuous 3D geometry estimation on sequences of about 10,000 frames.

Conclusion: InfiniteVGGT shatters the long-standing dilemma between scalability and long-term stability in 3D visual geometry understanding, providing a solution for live systems with infinite-horizon streaming capability. The Long3D benchmark provides the definitive evaluation platform for future research in this area.

Abstract: The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling’’ the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT

[306] Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery

Tom Burgert, Leonard Hackel, Paolo Rota, Begüm Demir

Main category: cs.CV

TL;DR: GeoRank is a novel regularization method for contrastive self-supervised learning that optimizes spherical distances to embed geographical relationships into features for multispectral remote sensing images.

Details

Motivation: Applying self-supervised learning to multispectral remote sensing images presents unique challenges due to geographical and temporal variability, requiring methods that can effectively leverage these relationships.

Method: GeoRank introduces a regularization method that directly optimizes spherical distances to embed geographical relationships into the learned feature space, improving upon prior techniques that integrate geographical metadata.

Result: GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms like BYOL and DINO.

Conclusion: The paper presents both a novel regularization method (GeoRank) and a systematic investigation of key adaptations for contrastive SSL in multispectral remote sensing, providing insights into data augmentations, dataset characteristics, and temporal dependencies.

Abstract: Self-supervised learning (SSL) has become a powerful paradigm for learning from large, unlabeled datasets, particularly in computer vision (CV). However, applying SSL to multispectral remote sensing (RS) images presents unique challenges and opportunities due to the geographical and temporal variability of the data. In this paper, we introduce GeoRank, a novel regularization method for contrastive SSL that improves upon prior techniques by directly optimizing spherical distances to embed geographical relationships into the learned feature space. GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms (e.g., BYOL, DINO). Beyond this, we present a systematic investigation of key adaptations of contrastive SSL for multispectral RS images, including the effectiveness of data augmentations, the impact of dataset cardinality and image size on performance, and the task dependency of temporal views. Code is available at https://github.com/tomburgert/georank.

[307] SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting

Sara Inácio, Hugo Proença, João C. Neves

Main category: cs.CV

TL;DR: SortWaste dataset for waste detection and ClutterScore metric for measuring scene complexity in waste sorting automation.

Details

Motivation: Manual waste sorting is inefficient and hazardous, while automated systems struggle with real-world waste variability and clutter. Lack of real-world datasets hinders development of effective automated waste sorting solutions.

Method: Introduced SortWaste dataset from Material Recovery Facility with dense annotations, and proposed ClutterScore metric using proxies like object count, class/size entropy, and spatial overlap to measure scene complexity.

Result: Benchmarked state-of-the-art object detection models achieving 59.7% mAP for plastic detection, but performance significantly drops in highly cluttered scenes measured by ClutterScore.

Conclusion: Current models struggle with complex waste scenes, highlighting the need for more challenging datasets and better algorithms for real-world waste sorting automation.

Abstract: The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene’s hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.

[308] 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera

Xiaopeng Guo, Yinzhe Xu, Huajian Huang, Sai-Kit Yeung

Main category: cs.CV

TL;DR: 360DVO is the first deep learning-based monocular omnidirectional visual odometry framework that uses a distortion-aware spherical feature extractor and omnidirectional differentiable bundle adjustment to achieve robust pose estimation from 360-degree images.

Details

Motivation: Existing omnidirectional visual odometry methods rely on handcrafted features or photometric objectives, which lack robustness in challenging scenarios like aggressive motion and varying illumination. There's a need for more robust deep learning-based approaches for 360-degree camera systems.

Method: Proposes 360DVO with two key components: 1) DAS-Feat (distortion-aware spherical feature extractor) that adaptively learns distortion-resistant features from 360-degree images, and 2) ODBA (omnidirectional differentiable bundle adjustment) module that uses sparse feature patches to establish constraints for effective pose estimation.

Result: Extensive experiments on a new real-world OVO benchmark and public synthetic datasets (TartanAir V2 and 360VO) show that 360DVO surpasses state-of-the-art baselines (360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%.

Conclusion: 360DVO represents a significant advancement in omnidirectional visual odometry by introducing the first deep learning-based framework that effectively handles distortion in 360-degree images and demonstrates superior performance in challenging real-world scenarios.

Abstract: Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: https://chris1004336379.github.io/360DVO-homepage

[309] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping

Saurabh Kaushik, Lalit Maurya, Beth Tellman

Main category: cs.CV

TL;DR: Prithvi-CAFE combines a pretrained Geo-Foundation Model encoder with a CNN residual branch using attention modules to improve flood mapping by capturing local details while maintaining long-range dependencies, achieving state-of-the-art results on Sen1Flood11 and FloodPlanet datasets.

Details

Motivation: Geo-Foundation Models (GFMs) struggle with flood mapping tasks, particularly failing to outperform baseline U-Net on the Sen1Flood11 dataset due to limitations in capturing critical local nuances essential for accurate flood segmentation.

Method: Prithvi-CAFE integrates the Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). It enables fast fine-tuning through adapters and performs multi-scale, multi-level fusion between GFM features and CNN features to capture both local details and long-range dependencies.

Result: Achieves state-of-the-art results: On Sen1Flood11 test data, IoU 83.41 vs original Prithvi (82.50) and other GFMs; on hold-out test site, IoU 81.37 vs baseline U-Net (70.57) and original Prithvi (72.42). On FloodPlanet, IoU 64.70 vs U-Net (60.14) and other GFMs.

Conclusion: Prithvi-CAFE demonstrates strong potential for improving segmentation tasks where multi-channel/multi-modal data provide complementary information and local details are critical. The simple yet effective architecture effectively combines global context from GFMs with local details from CNNs.

Abstract: Geo-Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using Sen1Flood11 dataset as a downstream task, GFMs struggles to outperform the baseline U-Net, highlighting model’s limitation in capturing critical local nuances. To address this, we present the Prithvi-Complementary Adaptive Fusion Encoder (CAFE), which integrate Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). Prithvi-CAFE enables fast and efficient fine-tuning through adapters in Prithvi and performs multi-scale, multi-level fusion with CNN features, capturing critical local details while preserving long-range dependencies. We achieve state-of-the-art results on two comprehensive flood mapping datasets: Sen1Flood11 and FloodPlanet. On Sen1Flood11 test data, Prithvi-CAFE (IoU 83.41) outperforms the original Prithvi (IoU 82.50) and other major GFMs (TerraMind 82.90, DOFA 81.54, spectralGPT: 81.02). The improvement is even more pronounced on the hold-out test site, where Prithvi-CAFE achieves an IoU of 81.37 compared to the baseline U-Net (70.57) and original Prithvi (72.42). On FloodPlanet, Prithvi-CAFE also surpasses the baseline U-Net and other GFMs, achieving an IoU of 64.70 compared to U-Net (60.14), Terramind (62.33), DOFA (59.15) and Prithvi 2.0 (61.91). Our proposed simple yet effective Prithvi-CAFE demonstrates strong potential for improving segmentation tasks where multi-channel and multi-modal data provide complementary information and local details are critical. The code is released on \href{https://github.com/Sk-2103/Prithvi-CAFE}{Prithvi-CAFE Github}

[310] Fusion2Print: Deep Flash-Non-Flash Fusion for Contactless Fingerprint Matching

Roja Sahoo, Anoop Namboodiri

Main category: cs.CV

TL;DR: Fusion2Print (F2P) is a novel framework that fuses paired flash and non-flash contactless fingerprint images to overcome limitations of single-capture methods, achieving superior recognition performance through attention-based fusion and cross-domain compatibility.

Details

Motivation: Contactless fingerprint recognition offers hygienic advantages but suffers from degraded ridge clarity due to illumination variations, skin discoloration, and specular reflections. Single-capture methods (flash or non-flash) have trade-offs: flash preserves ridge detail but introduces noise, while non-flash reduces noise but lowers ridge contrast.

Method: 1) Construct FNF Database with paired flash-non-flash contactless fingerprints; 2) Perform manual flash-non-flash subtraction to isolate ridge-preserving signals; 3) Use lightweight attention-based fusion network to integrate modalities, emphasizing informative channels and suppressing noise; 4) Apply U-Net enhancement module to produce optimally weighted grayscale image; 5) Employ deep embedding model with cross-domain compatibility for unified embedding space.

Result: F2P achieves superior recognition performance with AUC=0.999 and EER=1.12%, outperforming single-capture baselines like Verifinger and DeepPrint. The framework enhances ridge clarity and creates discriminative, robust representations compatible with both contactless and contact-based fingerprints.

Conclusion: Fusion2Print successfully addresses the limitations of single-capture contactless fingerprint recognition by systematically fusing flash and non-flash modalities. The framework demonstrates that complementary information from paired captures can significantly improve ridge clarity and recognition performance while maintaining cross-domain compatibility with existing contact-based systems.

Abstract: Contactless fingerprint recognition offers a hygienic and convenient alternative to contact-based systems, enabling rapid acquisition without latent prints, pressure artifacts, or hygiene risks. However, contactless images often show degraded ridge clarity due to illumination variation, subcutaneous skin discoloration, and specular reflections. Flash captures preserve ridge detail but introduce noise, whereas non-flash captures reduce noise but lower ridge contrast. We propose Fusion2Print (F2P), the first framework to systematically capture and fuse paired flash-non-flash contactless fingerprints. We construct a custom paired dataset, FNF Database, and perform manual flash-non-flash subtraction to isolate ridge-preserving signals. A lightweight attention-based fusion network also integrates both modalities, emphasizing informative channels and suppressing noise, and then a U-Net enhancement module produces an optimally weighted grayscale image. Finally, a deep embedding model with cross-domain compatibility, generates discriminative and robust representations in a unified embedding space compatible with both contactless and contact-based fingerprints for verification. F2P enhances ridge clarity and achieves superior recognition performance (AUC=0.999, EER=1.12%) over single-capture baselines (Verifinger, DeepPrint).

[311] BEDS: Bayesian Emergent Dissipative Structures

Laurent Caraffa

Main category: cs.CV

TL;DR: BEDS framework unifies thermodynamics, Bayesian inference, and machine learning, proposing learning as flux-to-structure conversion through entropy export, with practical P2P implementation showing massive energy efficiency gains.

Details

Motivation: To create a unified theoretical framework connecting non-equilibrium thermodynamics, Bayesian inference, information geometry, and machine learning, bridging fundamental physics with practical system design for sustainable AI.

Method: Establishes formal isomorphism between thermodynamic processes and Bayesian updating, derives fundamental constants as Bayesian inference fixed points, links Gödel’s theorems to thermodynamic constraints, and implements P2P network architecture based on BEDS principles.

Result: Derived mathematical constants (e, π, φ) as fixed points of Bayesian inference, proposed Gödel-thermodynamics conjecture, and achieved six orders of magnitude energy efficiency improvement in distributed consensus systems with continuous learning capability.

Conclusion: BEDS provides both theoretical insights into learning as dissipative structure formation and practical pathway for sustainable AI, bridging physics, logic, and system design through the fundamental principle of learning as entropy-export-driven structure formation.

Abstract: We present BEDS (Bayesian Emergent Dissipative Structures), a theoretical framework that unifies concepts from non-equilibrium thermodynamics, Bayesian inference, information geometry, and machine learning. The central thesis proposes that learning, across physical, biological, and computational systems, fundamentally constitutes the conversion of flux into structure through entropy export. Building on Prigogine’s theory of dissipative structures, we establish a formal isomorphism between thermodynamic processes and Bayesian updating, demonstrating that sustainable learning systems must follow dissipative patterns where crystallized posteriors become priors for subsequent levels of emergence. We derive fundamental mathematical constants (e, π, φ) as fixed points of Bayesian inference under minimal axioms, suggesting these constants emerge necessarily from any system capable of representing and updating uncertainty. Furthermore, we propose a conjecture linking Gödel’s incompleteness theorems to thermodynamic constraints, hypothesizing that pathologies of formal systems (incompleteness, undecidability) are structurally analogous to dissipation deficits in physical systems. As practical validation, we present a peer-to-peer network architecture implementing BEDS principles, achieving six orders of magnitude improvement in energy efficiency compared to existing distributed consensus systems while enabling continuous learning. This work bridges fundamental physics, mathematical logic, and practical system design, offering both theoretical insights into the nature of learning and computation, and a concrete pathway toward sustainable artificial intelligence.

[312] Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding

Jingming He, Chongyi Li, Shiqi Wang, Sam Kwong

Main category: cs.CV

TL;DR: A joint enhancement framework for 3D semantic Gaussian modeling that synergizes semantic and rendering branches using anisotropic 3D Gaussian Chebyshev descriptors, adaptive resource allocation based on semantic/shape signals, and cross-scene knowledge transfer.

Details

Motivation: Current 3DGS semantic segmentation methods treat semantic and rendering branches separately, rely solely on 2D supervision while ignoring 3D Gaussian geometry, and use rendering gradients that are insufficient in subtle/textureless regions.

Method: 1) Anisotropic 3D Gaussian Chebyshev descriptor using Laplace-Beltrami operator to capture fine-grained 3D shape details; 2) Adaptive Gaussian allocation and spherical harmonics adjustment using local semantic and shape signals; 3) Cross-scene knowledge transfer module for continuous shape pattern updates.

Result: Experiments on multiple datasets show improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.

Conclusion: The proposed framework successfully addresses limitations of existing methods by jointly enhancing semantic and rendering branches, leveraging 3D shape information, and enabling efficient cross-scene knowledge transfer for robust 3D semantic Gaussian modeling.

Abstract: Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace-Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without relying solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.

[313] Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices

Shahnawaz Alam, Mohammed Mudassir Uddin, Mohammed Kaif Pasha

Main category: cs.CV

TL;DR: Combines neural network pruning with few-shot learning to create lightweight plant disease detection models that run efficiently on low-cost edge devices like Raspberry Pi.

Details

Motivation: Farmers in remote areas need accessible plant disease diagnosis tools, but existing deep learning models are too large for edge devices and require extensive labeled data that's expensive to collect.

Method: Proposes Disease-Aware Channel Importance Scoring (DACIS) integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline that identifies important neural network parts for disease classification and enables learning from limited examples.

Result: Achieves 78% model size reduction while maintaining 92.3% of original accuracy, with compressed model running at 7 FPS on Raspberry Pi 4, enabling real-time field diagnosis.

Conclusion: The approach makes practical, real-time plant disease detection feasible for smallholder farmers using affordable edge devices, addressing both computational and data collection challenges.

Abstract: Farmers in remote areas need quick and reliable methods for identifying plant diseases, yet they often lack access to laboratories or high-performance computing resources. Deep learning models can detect diseases from leaf images with high accuracy, but these models are typically too large and computationally expensive to run on low-cost edge devices such as Raspberry Pi. Furthermore, collecting thousands of labeled disease images for training is both expensive and time-consuming. This paper addresses both challenges by combining neural network pruning – removing unnecessary parts of the model – with few-shot learning, which enables the model to learn from limited examples. This paper proposes Disease-Aware Channel Importance Scoring (DACIS), a method that identifies which parts of the neural network are most important for distinguishing between different plant diseases, integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78% while maintaining 92.3% of the original accuracy, with the compressed model running at 7 frames per second on a Raspberry Pi 4, making real-time field diagnosis practical for smallholder farmers.

[314] Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto

Main category: cs.CV

TL;DR: Talk2Move is an RL-based diffusion framework that enables precise spatial transformation of objects in scenes using natural language instructions, outperforming existing text-guided editing methods in spatial accuracy and scene coherence.

Details

Motivation: Existing text-based manipulation methods struggle with object-level geometric transformations (translating, rotating, resizing) due to scarce paired supervision data and limitations of pixel-level optimization approaches.

Method: Uses Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts from input images and lightweight textual variations, eliminating need for costly paired data. Incorporates spatial reward guidance, off-policy step evaluation, active step sampling, and object-centric spatial rewards for displacement, rotation, and scaling.

Result: Outperforms existing text-guided editing approaches in both spatial accuracy and scene coherence on curated benchmarks, achieving precise, consistent, and semantically faithful object transformations.

Conclusion: Talk2Move provides an effective RL-based diffusion framework for text-instructed spatial transformation of objects, enabling interpretable and coherent geometric manipulations without requiring expensive paired supervision data.

Abstract: We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

[315] VINO: A Unified Visual Generator with Interleaved OmniModal Context

Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye

Main category: cs.CV

TL;DR: VINO is a unified visual generator that handles both image and video generation/editing in a single framework using a shared diffusion backbone with multimodal conditioning.

Details

Motivation: Current approaches rely on task-specific models or independent modules for different modalities (images vs videos), leading to fragmented systems. The authors aim to create a unified framework that can handle diverse visual creation and editing tasks under one model.

Method: VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT). Multimodal inputs (text, images, videos) are encoded as interleaved conditioning tokens to guide the diffusion process. The system uses a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator.

Result: VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits across diverse generation and editing benchmarks. It supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content.

Conclusion: VINO presents a practical path toward scalable unified visual generation and highlights the promise of interleaved, in-context computation as a foundation for general-purpose visual creation, avoiding modality-specific architectural components.

Abstract: We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.

[316] ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik

Main category: cs.CV

TL;DR: ExposeAnyone: A self-supervised diffusion-based method for detecting unknown deepfake manipulations by personalizing to specific subjects and measuring identity distances through diffusion reconstruction errors.

Details

Motivation: Current deepfake detection methods fail to generalize to unseen manipulations due to overfitting to specific forgery patterns from supervised training. Self-supervised methods have potential but struggle to learn discriminative representations.

Method: Proposes ExposeAnyone, a fully self-supervised approach using a diffusion model that generates expression sequences from audio. The model is personalized to specific subjects using reference sets, then computes identity distances between suspected videos and personalized subjects via diffusion reconstruction errors.

Result: 1) Outperforms previous SOTA by 4.22 percentage points in average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets; 2) Capable of detecting Sora2-generated videos where previous approaches fail; 3) Highly robust to corruptions like blur and compression.

Conclusion: ExposeAnyone demonstrates superior generalization to unknown deepfake manipulations through self-supervised learning, showing strong performance on various datasets and robustness to real-world corruptions, making it applicable for practical face forgery detection.

Abstract: Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.

[317] Seal2Real: Prompt Prior Learning on Diffusion Model for Unsupervised Document Seal Data Generation and Realisation

Mingfu Yan, Jiancheng Huang, Shifeng Chen

Main category: cs.CV

TL;DR: Seal2Real is a generative framework that synthesizes labeled document seal data using Stable Diffusion to address dataset scarcity, enabling better performance on seal-related document processing tasks.

Details

Motivation: Progress in seal-related document processing tasks (segmentation, verification, removal, text recognition) is hindered by scarcity of labeled datasets needed for supervised learning.

Method: Proposes Seal2Real framework with prompt prior learning architecture built on pre-trained Stable Diffusion model to transfer generative capability to unsupervised seal image synthesis. Also introduces Seal-DB dataset with 20,000 labeled images.

Result: Seal2Real produces highly realistic synthetic seal images that significantly enhance performance of downstream seal-related tasks on real-world data. Experimental evaluations on Seal-DB demonstrate effectiveness and practical value.

Conclusion: The proposed generative framework addresses dataset scarcity for seal-related research, enabling better performance on commercial document processing tasks through synthetic data generation.

Abstract: Seal-related tasks in document processing-such as seal segmentation, authenticity verification, seal removal, and text recognition under seals-hold substantial commercial importance. However, progress in these areas has been hindered by the scarcity of labeled document seal datasets, which are essential for supervised learning. To address this limitation, we propose Seal2Real, a novel generative framework designed to synthesize large-scale labeled document seal data. As part of this work, we also present Seal-DB, a comprehensive dataset containing 20,000 labeled images to support seal-related research. Seal2Real introduces a prompt prior learning architecture built upon a pre-trained Stable Diffusion model, effectively transferring its generative capability to the unsupervised domain of seal image synthesis. By producing highly realistic synthetic seal images, Seal2Real significantly enhances the performance of downstream seal-related tasks on real-world data. Experimental evaluations on the Seal-DB dataset demonstrate the effectiveness and practical value of the proposed framework. The dataset is available at https://github.com/liuyifan6613/DocBank-Document-Enhancement-Dataset.

[318] Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang

Main category: cs.CV

TL;DR: ASVR introduces autoregressive semantic visual reconstruction to jointly learn visual and textual modalities, improving multimodal understanding by reconstructing semantic image representations rather than raw pixels.

Details

Motivation: Current LVLMs have limitations: they can't use images without captions, captions may omit visual details, and some visual content can't be adequately conveyed through text. They prioritize vision-to-language alignment but overlook fine-grained visual information.

Method: Autoregressive Semantic Visual Reconstruction (ASVR) - a unified autoregressive framework that reconstructs semantic representations of images rather than raw visual appearance. Models reconstruct discrete semantic tokens from continuous image features.

Result: ASVR improves LLaVA-1.5 by 5% average score across 14 multimodal benchmarks. Works across varying data scales (556k-2M) and different LLM backbones. Reconstruction of semantic tokens yields stable improvements while raw pixel reconstruction doesn’t help.

Conclusion: Autoregressive reconstruction of semantic visual representations effectively enhances multimodal understanding, addressing limitations of current LVLMs that focus only on textual supervision.

Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

[319] Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

Haopeng Li, Andong Deng, Jun Liu, Hossein Rahmani, Yulan Guo, Bernt Schiele, Mohammed Bennamoun, Qiuhong Ke

Main category: cs.CV

TL;DR: Sports-QA: First sports video question answering dataset with professional action understanding and fine-grained motion analysis, plus Auto-Focus Transformer achieving SOTA performance.

Details

Motivation: Sports video question answering is important for player training and information retrieval but unexplored due to lack of relevant datasets and challenging nature requiring professional action understanding and fine-grained motion analysis.

Method: Introduce Sports-QA dataset with various question types (descriptions, chronologies, causalities, counterfactual conditions) covering multiple sports, and propose Auto-Focus Transformer (AFT) that automatically focuses on particular scales of temporal information.

Result: Extensive experiments on Sports-QA show AFT achieves state-of-the-art performance, demonstrating effectiveness for sports VideoQA task.

Conclusion: Sports-QA fills dataset gap for sports VideoQA, and AFT effectively addresses task characteristics with automatic temporal scale focusing, enabling better sports video understanding.

Abstract: Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.

[320] Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou

Main category: cs.CV

TL;DR: SpatialReasoner-R1 is a new vision-language model that improves fine-grained spatial reasoning using M3CTS-generated reasoning trajectories and fine-grained DPO with spatial rewards.

Details

Motivation: Current VLMs struggle with fine-grained spatial reasoning requiring multi-step logic and precise spatial alignment, creating a need for better spatial reasoning capabilities.

Method: 1) M3CTS generates diverse, logically consistent LongCOT reasoning trajectories for supervision. 2) fDPO introduces segment-specific preference granularity with spatial rewards evaluating visual consistency, spatial grounding, and logical coherence.

Result: fDPO achieves 4.1% and 9.0% relative gains over standard DPO on spatial tasks. SpatialReasoner-R1 sets new SoTA on SpatialRGPT-Bench, outperforming strongest baseline by 9.4% average accuracy while maintaining competitive general VLM performance.

Conclusion: The proposed SpatialReasoner-R1 with M3CTS and fDPO effectively addresses fine-grained spatial reasoning limitations in VLMs, achieving state-of-the-art performance on spatial reasoning benchmarks.

Abstract: Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCOT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves relative performance gains of 4.1% and 9.0% over standard DPO on spatial qualitative and quantitative tasks, respectively. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.

[321] Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering

Haopeng Li, Mohammed Bennamoun, Jun Liu, Hossein Rahmani, Qiuhong Ke

Main category: cs.CV

TL;DR: This paper introduces an uncertainty-aware curriculum learning framework for VideoQA that uses probabilistic modeling to measure difficulty and improve generalization.

Details

Motivation: Existing VideoQA research overlooks the benefits of difficulty scheduling for model generalization. Conventional curriculum learning methods use training loss for difficulty measurement, which may not accurately capture the complexities of video-question pairs.

Method: Proposes uncertainty-aware curriculum learning where uncertainty guides dynamic difficulty adjustment. Models VideoQA as a stochastic computation graph with hidden representations as stochastic variables, yielding two types of uncertainty: data uncertainty and model confidence uncertainty.

Result: Comprehensive experiments show the approach achieves enhanced performance and effectively quantifies uncertainty in VideoQA contexts.

Conclusion: The uncertainty-aware curriculum learning framework successfully bridges the gap in VideoQA research by improving model generalization through tailored difficulty scheduling and uncertainty quantification.

Abstract: While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research. This paper seeks to bridge that gap by incorporating VideoQA into a curriculum learning (CL) framework that progressively trains models from simpler to more complex data. Recognizing that conventional self-paced CL methods rely on training loss for difficulty measurement, which might not accurately reflect the intricacies of video-question pairs, we introduce the concept of uncertainty-aware CL. Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. Furthermore, we address the challenge posed by uncertainty by presenting a probabilistic modeling approach for VideoQA. Specifically, we conceptualize VideoQA as a stochastic computation graph, where the hidden representations are treated as stochastic variables. This yields two distinct types of uncertainty: one related to the inherent uncertainty in the data and another pertaining to the model’s confidence. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments. The findings affirm that our approach not only achieves enhanced performance but also effectively quantifies uncertainty in the context of VideoQA.

[322] HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization

Guanglin Zhou, Zhongyi Han, Shiming Chen, Biwei Huang, Liming Zhu, Tongliang Liu, Lina Yao, Kun Zhang

Main category: cs.CV

TL;DR: HCVP introduces hierarchical contrastive visual prompts for domain generalization, using instance-dependent generative prompts to better separate invariant features from domain-specific characteristics.

Details

Motivation: Current DG methods with fixed model structures or uniform parameterization tend to blend domain-specific aspects, struggle with nuanced inter-domain variations, and may exhibit domain bias, hindering precise learning of domain-invariant features.

Method: Hierarchical Contrastive Visual Prompt (HCVP) methodology with: 1) hierarchical prompt generation network enhanced by prompt contrastive learning, 2) instance-dependent generative prompts tailored to different domains and tasks, 3) prompt modulation network to incorporate visual prompts into vision transformer backbone.

Result: Experiments on five DG datasets show HCVP outperforms both established DG algorithms and adaptation protocols.

Conclusion: HCVP represents a significant advancement in DG by using generative, instance-dependent visual prompts to better separate invariant features from specific characteristics, improving generalization to unseen domains.

Abstract: Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. In DG, the prevalent practice of constraining models to a fixed structure or uniform parameterization to encapsulate invariant features can inadvertently blend specific aspects. Such an approach struggles with nuanced differentiation of inter-domain variations and may exhibit bias towards certain domains, hindering the precise learning of domain-invariant features. Recognizing this, we introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization. Building on the emerging trend of visual prompts in the DG paradigm, our work introduces the novel \textbf{H}ierarchical \textbf{C}ontrastive \textbf{V}isual \textbf{P}rompt (HCVP) methodology. This represents a significant advancement in the field, setting itself apart with a unique generative approach to prompts, alongside an explicit model structure and specialized loss functions. Differing from traditional visual prompts that are often shared across entire datasets, HCVP utilizes a hierarchical prompt generation network enhanced by prompt contrastive learning. These generative prompts are instance-dependent, catering to the unique characteristics inherent to different domains and tasks. Additionally, we devise a prompt modulation network that serves as a bridge, effectively incorporating the generated visual prompts into the vision transformer backbone. Experiments conducted on five DG datasets demonstrate the effectiveness of HCVP, outperforming both established DG algorithms and adaptation protocols.

[323] A Survey on 3D Skeleton Based Person Re-Identification: Taxonomy, Advances, Challenges, and Interdisciplinary Prospects

Haocong Rao, Chunyan Miao

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of 3D skeleton-based person re-identification (SRID), covering task definition, methodology taxonomy, representative models, learning paradigms, benchmark evaluations, and future directions.

Details

Motivation: 3D skeleton-based person re-identification is an emerging research area with distinctive advantages across various application scenarios, but there lacks a comprehensive review and analysis of recent advances in this field.

Method: The paper organizes existing SRID methods into three categories: hand-crafted, sequence-based, and graph-based modeling approaches. It provides systematic taxonomy, elaborates on representative models, and reviews supervised, self-supervised, and unsupervised learning paradigms.

Result: A thorough evaluation of state-of-the-art SRID methods is conducted across various benchmarks and protocols, comparing their effectiveness, efficiency, and key properties. The paper identifies current limitations and provides performance comparisons.

Conclusion: The survey presents key challenges and prospects for advancing future SRID research, and highlights interdisciplinary applications with a case study, providing guidance for researchers in this emerging field.

Abstract: Person re-identification via 3D skeletons is an important emerging research area that attracts increasing attention within the pattern recognition community. With distinctive advantages across various application scenarios, numerous 3D skeleton based person re-identification (SRID) methods with diverse skeleton modeling and learning paradigms have been proposed in recent years. In this paper, we provide a comprehensive review and analysis of recent SRID advances. First of all, we define the SRID task and provide an overview of its origin and major advancements. Secondly, we formulate a systematic taxonomy that organizes existing methods into three categories centered on hand-crafted, sequence-based, and graph-based modeling. Then, we elaborate on the representative models along these three types with an illustration of foundational mechanisms. Meanwhile, we provide an overview of mainstream supervised, self-supervised, and unsupervised SRID learning paradigms and corresponding common methods. A thorough evaluation of state-of-the-art SRID methods is further conducted over various types of benchmarks and protocols to compare their effectiveness, efficiency, and key properties. Finally, we present the key challenges and prospects to advance future research, and highlight interdisciplinary applications of SRID with a case study.

[324] Attire-Based Anomaly Detection in Restricted Areas Using YOLOv8 for Enhanced CCTV Security

Abdul Aziz A. B, Aindri Bajpai

Main category: cs.CV

TL;DR: A YOLOv8-based surveillance system that detects unauthorized individuals in restricted areas by analyzing uniform patterns in CCTV footage, enhanced with soft computing for adaptability.

Details

Motivation: Traditional security measures struggle with monitoring unauthorized access in restricted areas, creating a need for intelligent surveillance systems that can automatically detect unauthorized individuals based on attire.

Method: Uses YOLOv8 object detection algorithm trained on comprehensive uniform pattern datasets, enhanced with soft computing techniques to handle dynamic environments and varying lighting conditions.

Result: Developed a sophisticated security solution capable of precise uniform-based anomaly detection, establishing a foundation for robust security systems in sensitive locations.

Conclusion: The YOLOv8-based surveillance system demonstrates significant potential for ensuring safety in restricted areas through automated uniform-based unauthorized access detection.

Abstract: This research introduces an innovative security enhancement approach, employing advanced image analysis and soft computing. The focus is on an intelligent surveillance system that detects unauthorized individuals in restricted areas by analyzing attire. Traditional security measures face challenges in monitoring unauthorized access. Leveraging YOLOv8, an advanced object detection algorithm, our system identifies authorized personnel based on their attire in CCTV footage. The methodology involves training the YOLOv8 model on a comprehensive dataset of uniform patterns, ensuring precise recognition in specific regions. Soft computing techniques enhance adaptability to dynamic environments and varying lighting conditions. This research contributes to image analysis and soft computing, providing a sophisticated security solution. Emphasizing uniform-based anomaly detection, it establishes a foundation for robust security systems in restricted areas. The outcomes highlight the potential of YOLOv8-based surveillance in ensuring safety in sensitive locations.

[325] RaffeSDG: Random Frequency Filtering enabled Single-source Domain Generalization for Medical Image Segmentation

Heng Li, Haojin Li, Jianyu Chen, Mingyang Ou, Hai Shu, Heng Miao

Main category: cs.CV

TL;DR: RaffeSDG: A single-source domain generalization method using random frequency filtering for medical image segmentation that handles domain shifts in data-scarce scenarios.

Details

Motivation: Deep learning models struggle with domain shifts between source and target data, especially in clinical settings where annotated medical data is scarce due to privacy and professional constraints. Existing cross-domain strategies are limited by data constraints and computational costs.

Method: Proposes Random frequency filtering enabled Single-source Domain Generalization (RaffeSDG): 1) Frequency filter-based data augmentation to introduce domain variability within single-source domain by varying frequency space and blending homologous samples, 2) Gaussian filter-based structural saliency to learn robust representations across augmented samples for training generalizable segmentation models.

Result: Extensive experiments on segmentation tasks for three human tissues imaged by four diverse modalities show compelling evidence of RaffeSDG’s effectiveness for out-of-domain inference, demonstrating its potential and generalizability.

Conclusion: RaffeSDG enables robust out-of-domain inference with segmentation models trained on single-source domains, addressing domain shift challenges in data-scarce medical scenarios through frequency-based augmentation and structural saliency learning.

Abstract: Deep learning models often encounter challenges in making accurate inferences when there are domain shifts between the source and target data. This issue is particularly pronounced in clinical settings due to the scarcity of annotated data resulting from the professional and private nature of medical data. Although various cross-domain strategies have been explored, including frequency-based approaches that vary appearance while preserving semantics, many remain limited by data constraints and computational cost. To tackle domain shifts in data-scarce medical scenarios, we propose a Random frequency filtering enabled Single-source Domain Generalization algorithm (RaffeSDG), which promises robust out-of-domain inference with segmentation models trained on a single-source domain. A frequency filter-based data augmentation strategy is first proposed to promote domain variability within a single-source domain by introducing variations in frequency space and blending homologous samples. Then Gaussian filter-based structural saliency is also leveraged to learn robust representations across augmented samples, further facilitating the training of generalizable segmentation models. To validate the effectiveness of RaffeSDG, we conducted extensive experiments involving out-of-domain inference on segmentation tasks for three human tissues imaged by four diverse modalities. Through thorough investigations and comparisons, compelling evidence was observed in these experiments, demonstrating the potential and generalizability of RaffeSDG. The code is available at https://github.com/liamheng/Non-IID_Medical_Image_Segmentation.

[326] PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation

Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Juan Yun, Se Hong Park, Sung Won Han

Main category: cs.CV

TL;DR: PrevMatch is a simple plug-in method for semi-supervised semantic segmentation that uses previous model snapshots to generate pseudo-label guidance, improving performance with minimal computational overhead.

Details

Motivation: Existing semi-supervised segmentation methods (Mean Teacher, co-training) suffer from complex training pipelines and high computational burden, limiting their scalability and compatibility.

Method: Two core strategies: (1) using previous model snapshots to generate additional pseudo-label guidance (previous guidance), and (2) a highly randomized ensemble strategy to maximize effectiveness of previous guidance.

Result: Experimental results on three benchmark semantic segmentation datasets show significant performance improvements when integrating PrevMatch into existing methods. Analysis indicates stable optimization and improved generalization.

Conclusion: PrevMatch is an effective, simple plug-in method that can be seamlessly integrated into existing semi-supervised learning frameworks with minimal computational overhead, addressing limitations of current approaches.

Abstract: In semi-supervised semantic segmentation, the Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. However, despite their high performance, these approaches frequently involve complex training pipelines and a substantial computational burden, limiting the scalability and compatibility of these methods. In this paper, we propose a PrevMatch framework that effectively mitigates the aforementioned limitations by maximizing the utilization of the temporal knowledge obtained during the training process. The PrevMatch framework relies on two core strategies: (1) we reconsider the use of temporal knowledge and thus directly utilize previous models obtained during training to generate additional pseudo-label guidance, referred to as previous guidance. (2) we design a highly randomized ensemble strategy to maximize the effectiveness of the previous guidance. PrevMatch, a simple yet effective plug-in method, can be seamlessly integrated into existing semi-supervised learning frameworks with minimal computational overhead. Experimental results on three benchmark semantic segmentation datasets show that incorporating PrevMatch into existing methods significantly improves their performance. Furthermore, our analysis indicates that PrevMatch facilitates stable optimization during training, resulting in improved generalization performance.

[327] Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

Xinhai Chang, Kaichen Zhou

Main category: cs.CV

TL;DR: EpiS is a generalizable neural surface reconstruction framework that uses epipolar geometry and epipolar transformers to improve sparse-view reconstruction quality, outperforming existing methods without per-scene optimization.

Details

Motivation: Existing generalizable neural surface reconstruction methods rely on cost volumes with simple statistics (mean/variance) that discard view-dependent geometric structure, leading to over-smoothed reconstructions, especially under sparse-view settings with severe geometric ambiguity and occlusions.

Method: EpiS leverages epipolar geometry by using coarse cost-volume features to guide aggregation of fine-grained epipolar features sampled along epipolar lines across source views. An epipolar transformer fuses multi-view information, followed by ray-wise aggregation to produce SDF-aware features. Additionally, a geometry regularization strategy uses a pretrained monocular depth model with scale-invariant global and local constraints to mitigate information loss under sparse views.

Result: Extensive experiments on DTU and BlendedMVS datasets show that EpiS significantly outperforms state-of-the-art generalizable surface reconstruction methods under sparse-view settings, while maintaining strong generalization without requiring per-scene optimization.

Conclusion: EpiS successfully addresses sparse-view surface reconstruction challenges by explicitly incorporating epipolar geometry and effective regularization, demonstrating superior performance and generalization capability compared to existing methods.

Abstract: Reconstructing accurate surfaces from sparse multi-view images remains challenging due to severe geometric ambiguity and occlusions. Existing generalizable neural surface reconstruction methods primarily rely on cost volumes that summarize multi-view features using simple statistics (e.g., mean and variance), which discard critical view-dependent geometric structure and often lead to over-smoothed reconstructions. We propose EpiS, a generalizable neural surface reconstruction framework that explicitly leverages epipolar geometry for sparse-view inputs. Instead of directly regressing geometry from cost-volume statistics, EpiS uses coarse cost-volume features to guide the aggregation of fine-grained epipolar features sampled along corresponding epipolar lines across source views. An epipolar transformer fuses multi-view information, followed by ray-wise aggregation to produce SDF-aware features for surface estimation. To further mitigate information loss under sparse views, we introduce a geometry regularization strategy that leverages a pretrained monocular depth model through scale-invariant global and local constraints. Extensive experiments on DTU and BlendedMVS demonstrate that EpiS significantly outperforms state-of-the-art generalizable surface reconstruction methods under sparse-view settings, while maintaining strong generalization without per-scene optimization.

[328] Training-Free Video Editing via Optical Flow-Enhanced Score Distillation

Lianghan Zhu, Yanqi Bao, Jing Huo, Jing Wu, Yu-Kun Lai, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: Proposes a score distillation paradigm for training-free video editing that iteratively optimizes original videos using editing gradients, with preservation losses and smoothing techniques to maintain unedited regions and ensure temporal consistency.

Details

Motivation: Current training-free video editing methods suffer from lossy inversion processes that damage unedited regions, and feature/attention manipulation causes over-editing and temporal inconsistency problems.

Method: Uses score distillation from pre-trained text-to-video models to iteratively optimize original videos with editing gradients. Adds content preservation loss, global consistency auxiliary loss, and optical flow-based local editing gradient smoothing.

Result: Achieves comparable or superior performance in preserving unedited regions, maintaining local temporal continuity, and ensuring global content consistency compared to state-of-the-art methods.

Conclusion: The proposed score distillation paradigm with preservation and smoothing techniques effectively addresses key challenges in training-free video editing, offering improved quality and consistency.

Abstract: The rapid advancement in visual generation, particularly the emergence of pre-trained text-to-image and text-to-video models, has catalyzed growing interest in training-free video editing research. Mirroring training-free image editing techniques, current approaches preserve original video information through video input inversion and manipulating intermediate features and attention during the inference process to achieve content editing. Although they have demonstrated promising results, the lossy nature of the inversion process poses significant challenges in maintaining unedited regions of the video. Furthermore, feature and attention manipulation during inference can lead to unintended over-editing and face challenges in both local temporal continuity and global content consistency. To address these challenges, this study proposes a score distillation paradigm based on pre-trained text-to-video models, where the original video is iteratively optimized through multiple steps guided by editing gradients provided by score distillation to ultimately obtain the target video. The iterative optimization starting from the original video, combined with content preservation loss, ensures the maintenance of unedited regions in the original video and suppresses over-editing. To further guarantee video content consistency and temporal continuity, we additionally introduce a global consistency auxiliary loss and optical flow prediction-based local editing gradient smoothing. Experiments demonstrate that these strategies effectively address the aforementioned challenges, achieving comparable or superior performance across multiple dimensions including preservation of unedited regions, local temporal continuity, and global content consistency of editing results, compared to state-of-the-art methods.

[329] ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation

Hesam Hosseini, Ghazal Hosseini Mighan, Amirabbas Afzali, Sajjad Amini, Amir Houmansadr

Main category: cs.CV

TL;DR: ULTra is a framework for interpreting Transformer embeddings and performing unsupervised semantic segmentation without fine-tuning, achieving state-of-the-art performance and demonstrating broad applicability across vision and language tasks.

Details

Motivation: Transformers have revolutionized Computer Vision but their complex self-attention mechanisms make latent token representations difficult to interpret, creating a need for better interpretability tools.

Method: ULTra framework interprets Transformer embeddings to uncover semantic patterns, enables unsupervised semantic segmentation using pre-trained models without fine-tuning, and includes a self-supervised training approach that learns an external transformation matrix without modifying the underlying model.

Result: Achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods, and successfully validates on both synthetic and real-world scenarios including Object Selection and interpretable text summarization using LLMs.

Conclusion: ULTra provides an effective framework for explaining the semantic structure of latent token representations in Transformers, demonstrating broad applicability across computer vision and language tasks while maintaining model integrity through external transformation learning.

Abstract: Transformers have revolutionized Computer Vision (CV) through self-attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre-trained models without requiring fine-tuning. Additionally, we propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.

[330] Towards Vision-Language Geo-Foundation Model: A Survey

Yue Zhou, Zhihang Zhong, Xue Yang

Main category: cs.CV

TL;DR: This paper provides the first comprehensive review of Vision-Language Geo-Foundation Models (VLGFMs), which are specialized vision-language models fine-tuned for earth observation tasks using geospatial image-text data.

Details

Motivation: Standard Vision-Language Foundation Models (VLFMs) trained on general image datasets perform poorly on earth observation tasks due to lack of geospatial data. The rise of geospatial image-text datasets has enabled the development of specialized VLGFMs with geo-perceptive capabilities.

Method: The paper systematically reviews VLGFMs by analyzing their core technologies: data construction methods for geospatial image-text pairs, model architectures adapted for earth observation, and applications across various multimodal geospatial tasks.

Result: This is the first comprehensive literature review of VLGFMs, providing a systematic analysis of recent developments in the field and maintaining an updated repository of related works.

Conclusion: The review concludes with insights, identifies current issues, and discusses future research directions for VLGFMs, highlighting their growing importance in earth observation and geospatial intelligence.

Abstract: Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.

[331] RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

Kaichen Zhou, Xinhai Chang, Taewhan Kim, Jiadong Zhang, Yang Cao, Chufei Peng, Fangneng Zhan, Hao Zhao, Hao Dong, Kai Ming Ting, Ye Zhu

Main category: cs.CV

TL;DR: RAD is a realistic robot-captured multi-view anomaly detection dataset with 13 object categories and 4 defect types, showing that 2D feature methods outperform 3D and VLM approaches in image-level detection.

Details

Motivation: Existing anomaly detection benchmarks are collected under controlled conditions with fixed viewpoints and stable illumination, failing to reflect real deployment scenarios in robotics and industrial inspection.

Method: Introduce RAD dataset captured by robot with over 60 viewpoints per object under uncontrolled lighting, covering 13 object categories and 4 realistic defect types. Benchmark 2D feature-based methods, 3D reconstruction pipelines, and vision-language models under pose-agnostic setting.

Result: Mature 2D feature-embedding methods consistently outperform recent 3D and VLM-based approaches at image level, while performance gap narrows for pixel-level localization. Reflective surfaces, geometric symmetry, and sparse viewpoint coverage limit current geometry-based and zero-shot methods.

Conclusion: RAD establishes a challenging realistic benchmark for robotic anomaly detection, highlighting critical open problems beyond controlled laboratory settings, particularly for reflective surfaces and viewpoint-dependent defect visibility.

Abstract: Anomaly detection is a core capability for robotic perception and industrial inspection, yet most existing benchmarks are collected under controlled conditions with fixed viewpoints and stable illumination, failing to reflect real deployment scenarios. We introduce RAD (Realistic Anomaly Detection), a robot-captured, multi-view dataset designed to stress pose variation, reflective materials, and viewpoint-dependent defect visibility. RAD covers 13 everyday object categories and four realistic defect types–scratched, missing, stained, and squeezed–captured from over 60 robot viewpoints per object under uncontrolled lighting. We benchmark a wide range of state-of-the-art approaches, including 2D feature-based methods, 3D reconstruction pipelines, and vision-language models (VLMs), under a pose-agnostic setting. Surprisingly, we find that mature 2D feature-embedding methods consistently outperform recent 3D and VLM-based approaches at the image level, while the performance gap narrows for pixel-level localization. Our analysis reveals that reflective surfaces, geometric symmetry, and sparse viewpoint coverage fundamentally limit current geometry-based and zero-shot methods. RAD establishes a challenging and realistic benchmark for robotic anomaly detection, highlighting critical open problems beyond controlled laboratory settings.

[332] TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Leonardo Plini, Luca Scofano, Edoardo De Matteis, Guido Maria D’Amely di Melendugno, Alessandro Flaborea, Andrea Sanchietti, Giovanni Maria Farinella, Fabio Galasso, Antonino Furnari

Main category: cs.CV

TL;DR: A dual-branch architecture for online detection of open-set procedural errors from egocentric videos, using step recognition and LLM-based anticipation to identify mistakes as mismatches between recognized and predicted actions.

Details

Motivation: Procedural error detection from egocentric videos is critical but challenging due to the open-set nature of mistakes (unforeseen errors may occur). Current techniques fail to effectively detect open-set procedural mistakes in online settings.

Method: Dual-branch architecture: (1) Recognition branch continuously performs step recognition from input frames and aggregates results into action tokens; (2) Anticipation branch uses Large Language Models (LLMs) to predict future action tokens based on previously predicted ones. Mistakes are detected as mismatches between recognized and anticipated actions.

Result: Extensive experiments on two procedural datasets demonstrate the method’s effectiveness and robustness in online applications, outperforming recognition and anticipation variants and state-of-the-art models in thorough evaluations.

Conclusion: The proposed dual-branch architecture effectively addresses the challenge of online open-set procedural error detection by leveraging step recognition and LLM-based anticipation, showing promise for applications in manufacturing, healthcare, and skill-based training.

Abstract: Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module’s output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

[333] MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

Haopeng Fang, Di Qiu, Binjie Mao, He Tang

Main category: cs.CV

TL;DR: MotionCharacter is a framework for personalized text-to-video generation that enables fine-grained control over motion intensity while preserving character identity, addressing limitations in existing methods.

Details

Motivation: Existing personalized T2V methods cannot control motion intensity precisely due to entanglement of action semantics and magnitudes in text descriptions. They also struggle with identity preservation when modifying other attributes, limiting applications like virtual avatars and micro-expressions.

Method: MotionCharacter decouples motion into action type and intensity using: 1) Motion Control Module with text phrases for action type and optical flow metrics for intensity, guided by region-aware loss; 2) ID Content Insertion Module with ID-Consistency loss for identity preservation. Trained on Human-Motion dataset with detailed motion and facial annotations.

Result: Extensive experiments show MotionCharacter substantially improves over existing methods, generating identity-consistent videos that precisely adhere to specified motion types and intensities.

Conclusion: MotionCharacter successfully addresses the limitations of existing personalized T2V methods by enabling fine-grained motion control while maintaining high identity fidelity, advancing capabilities for applications requiring precise motion synthesis.

Abstract: Recent advancements in personalized Text-to-Video (T2V) generation have made significant strides in synthesizing character-specific content. However, these methods face a critical limitation: the inability to perform fine-grained control over motion intensity. This limitation stems from an inherent entanglement of action semantics and their corresponding magnitudes within coarse textual descriptions, hindering the generation of nuanced human videos and limiting their applicability in scenarios demanding high precision, such as animating virtual avatars or synthesizing subtle micro-expressions. Furthermore, existing approaches often struggle to preserve high identity fidelity when other attributes are modified. To address these challenges, we introduce MotionCharacter, a framework for high-fidelity human video generation with precise motion control. At its core, MotionCharacter explicitly decouples motion into two independently controllable components: action type and motion intensity. This is achieved through two key technical contributions: (1) a Motion Control Module that leverages textual phrases to specify the action type and a quantifiable metric derived from optical flow to modulate its intensity, guided by a region-aware loss that localizes motion to relevant subject areas; and (2) an ID Content Insertion Module coupled with an ID-Consistency loss to ensure robust identity preservation during dynamic motions. To facilitate training for such fine-grained control, we also curate Human-Motion, a new large-scale dataset with detailed annotations for both motion and facial features. Extensive experiments demonstrate that MotionCharacter achieves substantial improvements over existing methods. Our framework excels in generating videos that are not only identity-consistent but also precisely adhere to specified motion types and intensities.

[334] AH-GS: Augmented 3D Gaussian Splatting for High-Frequency Detail Representation

Chenyang Xu, XingGuo Deng, Rui Zhong

Main category: cs.CV

TL;DR: AH-GS improves 3D Gaussian Splatting by enhancing high-frequency information learning through manifold complexity enhancement and network-based feature map loss, achieving better rendering quality than Scaffold-GS.

Details

Motivation: Scaffold-GS has limitations in fine-grained rendering due to its dependence on adequate viewing angles and poor ability to learn high-frequency information caused by neural network spectral bias.

Method: Proposes AH-GS which: 1) enhances manifold complexity of input features to give 3D Gaussians in complex regions higher-frequency encodings, 2) uses network-based feature map loss, and 3) incorporates high-frequency reinforce loss to improve detailed frequency information capture.

Result: Significantly improves rendering fidelity; in MipNeRF360-garden scenario, exceeds Scaffold-GS rendering quality in just 15K iterations.

Conclusion: AH-GS effectively addresses the high-frequency learning limitations of 3D-GS models, achieving superior rendering quality through enhanced feature encoding and specialized loss functions.

Abstract: The 3D Gaussian Splatting (3D-GS) is a novel method for scene representation and view synthesis. Although Scaffold-GS achieves higher quality real-time rendering compared to the original 3D-GS, its fine-grained rendering of the scene is extremely dependent on adequate viewing angles. The spectral bias of neural network learning results in Scaffold-GS’s poor ability to perceive and learn high-frequency information in the scene. In this work, we propose enhancing the manifold complexity of input features and using network-based feature map loss to improve the image reconstruction quality of 3D-GS models. We introduce AH-GS, which enables 3D Gaussians in structurally complex regions to obtain higher-frequency encodings, allowing the model to more effectively learn the high-frequency information of the scene. Additionally, we incorporate high-frequency reinforce loss to further enhance the model’s ability to capture detailed frequency information. Our result demonstrates that our model significantly improves rendering fidelity, and in specific scenarios (e.g., MipNeRf360-garden), our method exceeds the rendering quality of Scaffold-GS in just 15K iterations.

Dillon Loh, Tomasz Bednarz, Xinxing Xia, Frank Guan

Main category: cs.CV

TL;DR: AdaVLN extends Visual Language Navigation to include dynamic human obstacles in 3D indoor environments, with new simulator and dataset support.

Details

Motivation: Real-world navigation must handle dynamic human obstacles, but previous VLN research focused on static settings, creating a sim-to-real gap.

Method: Proposed Adaptive Visual Language Navigation (AdaVLN) task with AdaVLN simulator and AdaR2R datasets, featuring animated human models and “freeze-time” mechanism for fair comparisons.

Result: Evaluated baseline models on AdaVLN, analyzed unique challenges, and demonstrated potential to bridge sim-to-real gap in VLN research.

Conclusion: AdaVLN addresses the limitation of static VLN by introducing dynamic human obstacles, providing better simulation of real-world navigation challenges.

Abstract: Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a “freeze-time” mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

[336] SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

Joongwon Chae, Zhenyu Wang, Peiwu Qin

Main category: cs.CV

TL;DR: SJTU framework bridges vision-language models with segmentation using spatial coordinate detection, achieving competitive IoU scores on COCO and Pascal VOC with practical inference times.

Details

Motivation: Existing vision-language models have limitations in fine-grained spatial localization and segmentation capabilities, creating a need for better integration of segmentation with multimodal understanding.

Method: Uses spatial coordinate understanding through normalized coordinate detection for bounding boxes, then transforms them into segmentation outputs, connecting spatial and language representations in multimodal architectures.

Result: Achieves IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC, with average inference time of 7 seconds per image on NVIDIA RTX 3090 GPU at 512x512 resolution.

Conclusion: SJTU framework effectively bridges vision-language interaction with precise segmentation through spatial coordinate detection, demonstrating both accuracy and practical deployability for multimodal segmentation tasks.

Abstract: Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By utilizing normalized coordinate detection for bounding boxes and transforming them into actionable segmentation outputs, we establish a connection between spatial and language representations in multimodal architectures. Experimental results demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU with 512x512 resolution images yields an average inference time of 7 seconds per image, demonstrating the framework’s effectiveness in both accuracy and practical deployability. The project code is available at https://github.com/jw-chae/SJTU

[337] Bridging Geometry and Appearance: Topological Features for Robust Self-Supervised Segmentation

Kebin Peng, Haotang Li, Zhenyu Qi, Huashan Chen, Zi Wang, Wei Zhang, Sen He, Huanrui Yang, Qing Guo

Main category: cs.CV

TL;DR: GASeg is a self-supervised semantic segmentation framework that uses topological information to bridge appearance and geometry, addressing appearance ambiguities through differentiable box-counting and adversarial topological augmentation.

Details

Motivation: Self-supervised semantic segmentation methods often fail with appearance ambiguities due to over-reliance on unstable appearance-based features like shadows, glare, and local textures. There's a need to incorporate more stable geometric/topological information.

Method: GASeg uses Differentiable Box-Counting (DBC) to quantify multi-scale topological statistics from geometric and appearance feature streams. It employs Topological Augmentation (TopoAug) - adversarial morphological operators to simulate real-world ambiguities. A multi-objective GALoss enforces cross-modal alignment between geometric and appearance features.

Result: GASeg achieves state-of-the-art performance on four benchmarks: COCO-Stuff, Cityscapes, and PASCAL, demonstrating the effectiveness of bridging geometry and appearance via topological information.

Conclusion: The paper shows that incorporating stable topological information through the proposed GASeg framework effectively addresses appearance ambiguities in self-supervised semantic segmentation, leading to superior performance across multiple benchmarks.

Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

[338] Point Cloud to Mesh Reconstruction: Methods, Trade-offs, and Implementation Guide

Fatima Zahra Iguenfer, Achraf Hsain, Hiba Amissa, Yousra Chtouki

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey and practical guide for learning-based 3D mesh reconstruction from point clouds, categorizing methods into five paradigms and offering decision frameworks, failure analysis, benchmarks, and implementation resources.

Details

Motivation: Mesh reconstruction from point clouds is crucial for computer vision applications in robotics, autonomous systems, and medical imaging, but practitioners face challenges in selecting appropriate methods due to trade-offs between computational efficiency, geometric accuracy, and output constraints.

Method: The paper categorizes over 15 methods into five paradigms: PointNet family, autoencoder architectures, deformation-based methods, point-move techniques, and primitive-based approaches. It provides a decision framework mapping requirements to suitable methods, analyzes failure modes, conducts standardized ShapeNet benchmark comparisons, and curates maintained codebases.

Result: The work establishes a systematic framework for method selection, identifies common failure patterns in mesh reconstruction implementations, provides standardized performance comparisons across different paradigms, and offers practical implementation resources through curated codebases.

Conclusion: This survey serves as a comprehensive entry point for practitioners and researchers, bridging theoretical foundations with practical considerations to facilitate informed method selection and implementation in learning-based 3D mesh reconstruction from point clouds.

Abstract: Reconstructing meshes from point clouds is a fundamental task in computer vision with applications spanning robotics, autonomous systems, and medical imaging. Selecting an appropriate learning-based method requires understanding trade-offs between computational efficiency, geometric accuracy, and output constraints. This paper categorizes over fifteen methods into five paradigms – PointNet family, autoencoder architectures, deformation-based methods, point-move techniques, and primitive-based approaches – and provides practical guidance for method selection. We contribute: (1) a decision framework mapping input/output requirements to suitable paradigms, (2) a failure mode analysis to assist practitioners in debugging implementations, (3) standardized comparisons on ShapeNet benchmarks, and (4) a curated list of maintained codebases with implementation resources. By synthesizing both theoretical foundations and practical considerations, this work serves as an entry point for practitioners and researchers new to learning-based 3D mesh reconstruction.

[339] VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Dan Zhang, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong

Main category: cs.CV

TL;DR: VisionReward is a framework for learning human visual preferences in image/video generation using hierarchical assessment and interpretable linear weighting, outperforming existing reward models.

Details

Motivation: Current visual generative models lack proper alignment with human preferences, and existing reward models have limitations like lack of interpretability and potential biases.

Method: Uses hierarchical visual assessment to capture fine-grained preferences, linear weighting for interpretability, and multi-dimensional consistency strategy for preference optimization.

Result: Outperforms existing image/video reward models on both metrics and human evaluation: 17.2% better preference prediction than VideoScore, and 31.6% higher pairwise win rate for text-to-video models.

Conclusion: VisionReward provides an effective, interpretable framework for aligning visual generation with human preferences, with code and datasets publicly available.

Abstract: Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore. All code and datasets are provided at https://github.com/THUDM/VisionReward.

[340] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong

Main category: cs.CV

TL;DR: Dream-VL is a diffusion-based vision-language model that outperforms autoregressive VLMs on visual planning tasks, and Dream-VLA extends this to vision-language-action with superior robotic control performance.

Details

Motivation: Autoregressive VLMs have limitations in complex visual planning and dynamic robotic control due to sequential generation. The authors investigate diffusion-based LLMs as a foundation for VLMs to overcome these limitations.

Method: Introduce Dream-VL, a diffusion-based VLM built on diffusion LLMs. Then develop Dream-VLA through continuous pre-training on open robotic datasets, leveraging the bidirectional nature of diffusion models for action chunking and parallel generation.

Result: Dream-VL achieves SOTA among dVLMs and comparable to top AR-based VLMs. Dream-VLA achieves 97.2% success on LIBERO, 71.4% on SimplerEnv-Bridge, and 60.5% on SimplerEnv-Fractal, surpassing leading models like π₀ and GR00T-N1.

Conclusion: Diffusion-based VLMs/VLAs offer superior performance for visual planning and robotic control tasks compared to autoregressive baselines, with faster convergence and better action generation capabilities.

Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

[341] TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation

Yixiang Zhuang, Chunshan Ma, Yao Cheng, Xuan Cheng, Jing Liao, Juncong Lin

Main category: cs.CV

TL;DR: TalkingEyes: A novel data-driven method that generates diverse 3D eye gaze motions synchronized with speech, addressing a gap in speech-driven facial animation by jointly modeling head and eye gaze motions in separate latent spaces.

Details

Motivation: Current speech-driven 3D facial animation research has overlooked eye gaze animation due to weak correlation between speech and eye gaze, and scarcity of audio-gaze data, making it challenging to generate realistic 3D eye gaze motion from speech alone.

Method: 1) Constructed a 14-hour audio-gaze dataset with eye gaze, head, and facial motions from existing audio-visual datasets using lightweight eye gaze fitting and face reconstruction. 2) Developed a speech-to-motion translation framework that jointly generates head and eye gaze motions from speech but models them in separate latent spaces, reflecting physiological differences in rotation ranges.

Result: The method successfully generates diverse and natural 3D eye gaze motions from speech, as demonstrated through extensive quantitative and qualitative evaluations. The integrated TalkingEyes system synthesizes eye gaze motion, eye blinks, head motion, and facial motion collectively from speech.

Conclusion: The proposed approach effectively addresses the challenge of speech-driven eye gaze animation by leveraging physiological knowledge and separate latent space modeling, enabling more complete and realistic speech-driven 3D facial animation that includes previously overlooked eye gaze components.

Abstract: Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/

[342] VALLR: Visual ASR Language Model for Lip Reading

Marshall Thomas, Edward Fish, Richard Bowden

Main category: cs.CV

TL;DR: A novel two-stage phoneme-centric framework for lip reading that first predicts phonemes from visual inputs using a Video Transformer, then uses an LLM to reconstruct coherent text, achieving state-of-the-art performance with dramatically less training data.

Details

Motivation: Current lip reading methods struggle with high error rates due to coarticulation effects and viseme ambiguity, where different phonemes appear identical visually. Direct word prediction from visual cues is challenging because of these overlapping visemes and lack of auditory information.

Method: Two-stage phoneme-centric framework: 1) Video Transformer with CTC head predicts compact phoneme sequences from visual inputs, providing speaker invariance and reduced complexity; 2) Fine-tuned Large Language Model reconstructs coherent words and sentences using broader linguistic context from the phoneme sequences.

Result: State-of-the-art performance on LRS2 and LRS3 datasets, achieving SOTA WER of 18.7 on LRS3 while using 99.4% less labeled data than the next best approach. Significant reductions in Word Error Rate compared to existing methods.

Conclusion: The phoneme-centric approach effectively addresses viseme ambiguity and coarticulation challenges in lip reading by leveraging intermediate linguistic structure, achieving superior performance with high data efficiency compared to direct word prediction or multimodal pre-training methods.

Abstract: Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

[343] Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu

Main category: cs.CV

TL;DR: HR2G-shot is a hierarchical framework for few-shot action recognition that unifies inter-frame, inter-video, and inter-task relation modeling to capture shared temporal patterns across videos and historical tasks.

Details

Motivation: Existing few-shot action recognition methods treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, failing to capture shared temporal patterns across videos and reuse knowledge from historical tasks.

Method: HR2G-shot uses a hierarchical framework with three components: 1) inter-frame temporal modeling, 2) Inter-video Semantic Correlation (ISC) for cross-video frame-level interactions, and 3) Inter-task Knowledge Transfer (IKT) to retrieve and aggregate relevant temporal knowledge from a bank of historical episode tasks.

Result: Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading few-shot action recognition methods.

Conclusion: The proposed hierarchical relation-augmented representation generalization framework effectively captures task-specific temporal patterns from a holistic view by unifying inter-frame, inter-video, and inter-task relation modeling.

Abstract: Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

[344] Quantifying task-relevant representational similarity using decision variable correlation

Yu, Qian, Wilson S. Geisler, Xue-Xin Wei

Main category: cs.CV

TL;DR: The paper introduces Decision Variable Correlation (DVC) to compare decision strategies between models and brains, finding that while models resemble each other, they diverge from monkey visual cortex representations despite high ImageNet performance.

Details

Motivation: Previous studies show conflicting results about similarity between neural representations in visual cortex and deep neural networks. There's a need for a method that specifically compares task-relevant decision strategies rather than general representational alignment.

Method: Proposed Decision Variable Correlation (DVC) measures image-by-image correlation between decoded decisions based on internal neural representations in classification tasks. Applied to monkey V4/IT recordings and various network models trained on image classification.

Result: Model-model similarity matches monkey-monkey similarity, but model-monkey similarity is consistently lower. DVC decreases with better ImageNet performance. Adversarial training and larger datasets don’t improve model-monkey similarity, though they increase model-model similarity.

Conclusion: Task-relevant representations in monkey V4/IT diverge from those learned by image classification models, suggesting current models don’t capture biological decision strategies despite high performance on benchmark tasks.

Abstract: Previous studies have compared neural activities in the visual cortex to representations in deep neural networks trained on image classification. Interestingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the image-by-image correlation between the decoded decisions based on the internal neural representations in a classification task. Thus, it can capture task-relevant information rather than general representational alignment. We evaluate DVC using monkey V4/IT recordings and network models trained on image classification tasks. We find that model-model similarity is comparable to monkey-monkey similarity, whereas model-monkey similarity is consistently lower. Strikingly, DVC decreases with increasing network performance on ImageNet-1k. Adversarial training does not improve model-monkey similarity in task-relevant dimensions assessed using DVC, although it markedly increases the model-model similarity. Similarly, pre-training on larger datasets does not improve model-monkey similarity. These results suggest a divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks.

[345] SAM-aware Test-time Adaptation for Universal Medical Image Segmentation

Jianghao Wu, Yicheng Wu, Yutong Xie, Wenjia Bai, You Zhang, Feilong Tang, Yulong Li, Imran Razzak, Daniel F Schmidt, Yasmeen George

Main category: cs.CV

TL;DR: SAM-TTA is a lightweight test-time adaptation framework that enhances Segment Anything Model’s medical image segmentation by addressing channel and semantic discrepancies through bezier curve transformation and multi-scale adaptation.

Details

Motivation: SAM has limited adaptability across diverse medical domains, and fine-tuned variants like MedSAM lack generalizability to unseen data. There's a need to preserve SAM's inherent generalization while improving medical image segmentation accuracy.

Method: Proposes SAM-aware Test-time Adaptation (SAM-TTA) with two components: 1) Self-adaptive Bezier Curve-based Transformation (SBCT) maps single-channel medical images to SAM-compatible three-channel images using learnable parameters optimized at test time; 2) IoU-guided Multi-scale Adaptation (IMA) leverages SAM’s intrinsic IoU scores to enforce high output confidence, dual-scale prediction consistency, and intermediate feature consistency.

Result: Extensive experiments on eight public medical image segmentation tasks (six grayscale, two color endoscopic) show SAM-TTA consistently outperforms state-of-the-art test-time adaptation methods. On six grayscale datasets, it surpasses fully fine-tuned models with average 4.8% and 7.4% Dice improvements over MedSAM and SAM-Med2D.

Conclusion: SAM-TTA establishes a new paradigm for universal medical image segmentation by preserving SAM’s generalization ability while significantly improving segmentation accuracy through lightweight test-time adaptation, addressing both channel and semantic discrepancies between natural and medical images.

Abstract: Leveraging the Segment Anything Model (SAM) for medical image segmentation remains challenging due to its limited adaptability across diverse medical domains. Although fine-tuned variants, such as MedSAM, improve performance in scenarios similar to the training modalities or organs, they may lack generalizability to unseen data. To overcome this limitation, we propose SAM-aware Test-time Adaptation (SAM-TTA), a lightweight and flexible framework that preserves SAM’s inherent generalization ability while enhancing segmentation accuracy for medical images. SAM-TTA tackles two major challenges: (1) input-level discrepancy caused by channel mismatches between natural and medical images, and (2) semantic-level discrepancy due to different object characteristics in natural versus medical images (e.g., with clear boundaries vs. ambiguous structures). To this end, we introduce two complementary components: a self-adaptive Bezier Curve-based Transformation (SBCT), which maps single-channel medical images into SAM-compatible three-channel images via a few learnable parameters to be optimized at test time; and IoU-guided Multi-scale Adaptation (IMA), which leverages SAM’s intrinsic IoU scores to enforce high output confidence, dual-scale prediction consistency, and intermediate feature consistency, to improve semantic-level alignments. Extensive experiments on eight public medical image segmentation tasks, covering six grayscale and two color (endoscopic) tasks, demonstrate that SAM-TTA consistently outperforms state-of-the-art test-time adaptation methods. Notably, on six grayscale datasets, SAM-TTA even surpasses fully fine-tuned models, achieving significant Dice improvements (i.e., average 4.8% and 7.4% gains over MedSAM and SAM-Med2D) and establishing a new paradigm for universal medical image segmentation. Code is available at https://github.com/JianghaoWu/SAM-TTA.

[346] TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading

Byung Hoon Lee, Wooseok Shin, Sung Won Han

Main category: cs.CV

TL;DR: TD3Net proposes a temporal densely connected multi-dilated convolutional network for word-level lipreading that combines dense skip connections with multi-dilated convolutions to eliminate blind spots in the receptive field, achieving state-of-the-art performance with fewer parameters and FLOPs.

Details

Motivation: Existing TCN-based lipreading backends with dense skip connections still suffer from information loss due to blind spots in the receptive field, which fail to capture the continuous nature of lip movements effectively.

Method: TD3Net combines dense skip connections with multi-dilated temporal convolutions, applying different dilation factors to skip-connected features to create a wide and dense receptive field without blind spots.

Result: Achieves comparable performance to state-of-the-art methods on LRW and LRW-1000 datasets with higher accuracy, fewer parameters, and lower FLOPs than existing TCN-based backends. Visualization shows effective utilization of diverse temporal features while preserving temporal continuity.

Conclusion: TD3Net effectively addresses receptive field blind spots in lipreading systems, offering a more efficient backend architecture that better models continuous lip movements while reducing computational complexity.

Abstract: The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).

[347] Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection

Yuchu Jiang, Jiaming Chu, Jian Zhao, Xin Zhang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: Loupe: A lightweight framework for joint deepfake detection and localization using patch-aware classification with conditional queries and test-time adaptation.

Details

Motivation: Existing deepfake detection methods have limitations: they focus on either image classification or pixel localization, suffer from poor generalization across manipulation types, and often use complex architectures. There's a need for a lightweight solution that can simultaneously detect and localize forgeries while maintaining robustness to distribution shifts.

Method: Loupe integrates a patch-aware classifier with a segmentation module using conditional queries. It performs simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness, it introduces pseudo-label-guided test-time adaptation that uses patch-level predictions to supervise the segmentation head.

Result: Achieved state-of-the-art performance on DDL dataset, securing first place in IJCAI 2025 Deepfake Detection and Localization Challenge with overall score of 0.846. Validated effectiveness of patch-level fusion and conditional query design for improving both classification accuracy and spatial localization across diverse forgery patterns.

Conclusion: Loupe provides an effective lightweight framework for joint deepfake detection and localization that outperforms existing methods through its innovative patch-aware classification with conditional queries and test-time adaptation mechanism.

Abstract: The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.

[348] Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal Prediction

Rui An, Yifeng Zhang, Ziran Liang, Wenqi Fan, Yuxuan Liang, Xuequn Shang, Qing Li

Main category: cs.CV

TL;DR: Damba-ST: A domain-adaptive Mamba-based model for efficient urban spatio-temporal prediction with linear complexity and strong cross-city generalization.

Details

Motivation: Existing Transformer-based urban spatio-temporal models suffer from quadratic computational complexity and poor generalization across diverse cities/regions, limiting practical deployment in data-scarce or unseen areas.

Method: Proposes Damba-ST with two innovations: 1) Domain-adaptive state space model partitioning latent space into shared subspace (cross-domain commonalities) and domain-specific subspaces (intra-domain features); 2) Three Domain Adapters as domain-aware proxies to bridge disparate domain distributions and align cross-domain commonalities.

Result: Achieves state-of-the-art performance on prediction tasks with strong zero-shot generalization, enabling seamless deployment in new urban environments without extensive retraining/fine-tuning.

Conclusion: Damba-ST successfully combines Mamba’s linear complexity efficiency with enhanced domain adaptability, overcoming spatio-temporal heterogeneity limitations to enable practical, scalable urban spatio-temporal foundation models.

Abstract: Training urban spatio-temporal foundation models that generalize well across diverse regions and cities is critical for deploying urban services in unseen or data-scarce regions. Recent studies have typically focused on fusing cross-domain spatio-temporal data to train unified Transformer-based models. However, these models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment. Inspired by the efficiency of Mamba, a state space model with linear time complexity, we explore its potential for efficient urban spatio-temporal prediction. However, directly applying Mamba as a spatio-temporal backbone leads to negative transfer and severe performance degradation. This is primarily due to spatio-temporal heterogeneity and the recursive mechanism of Mamba’s hidden state updates, which limit cross-domain generalization. To overcome these challenges, we propose Damba-ST, a novel domain-adaptive Mamba-based model for efficient urban spatio-temporal prediction. Damba-ST retains Mamba’s linear complexity advantage while significantly enhancing its adaptability to heterogeneous domains. Specifically, we introduce two core innovations: (1) a domain-adaptive state space model that partitions the latent representation space into a shared subspace for learning cross-domain commonalities and independent, domain-specific subspaces for capturing intra-domain discriminative features; (2) three distinct Domain Adapters, which serve as domain-aware proxies to bridge disparate domain distributions and facilitate the alignment of cross-domain commonalities. Extensive experiments demonstrate the generalization and efficiency of Damba-ST. It achieves state-of-the-art performance on prediction tasks and demonstrates strong zero-shot generalization, enabling seamless deployment in new urban environments without extensive retraining or fine-tuning.

[349] DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection

Kunwei Lv, Zhiren Xiao, Hang Ren, Ping Lan

Main category: cs.CV

TL;DR: DGE-YOLO is an enhanced YOLO-based framework for multi-modal UAV object detection that fuses infrared and visible images using dual-branch architecture, multi-scale attention, and improved feature aggregation.

Details

Motivation: The rapid proliferation of UAVs requires robust object detection in aerial scenarios, but detecting small objects under complex conditions remains challenging. There's a need for effective multi-modal information fusion to improve detection performance.

Method: 1. Dual-branch architecture for modality-specific feature extraction from infrared and visible images. 2. Efficient Multi-scale Attention (EMA) mechanism to enhance feature learning across spatial scales. 3. Gather-and-Distribute (GD) module replacing conventional neck to mitigate information loss during feature aggregation.

Result: Extensive experiments on the Drone Vehicle dataset demonstrate superior performance over state-of-the-art methods, validating effectiveness in multi-modal UAV object detection tasks.

Conclusion: DGE-YOLO presents an effective solution for multi-modal UAV object detection by successfully fusing infrared and visible information through architectural innovations, achieving state-of-the-art performance on benchmark datasets.

Abstract: The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge.To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. We introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute(GD) module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.

[350] GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

Zhiwei Zhang, Zi Ye, Yibin Wen, Shuai Yuan, Haohuan Fu, Jianxi Huang, Juepeng Zheng

Main category: cs.CV

TL;DR: First fine-grained global terraced parcel dataset (GTPBD) with 200k+ manually annotated terraced parcels for precision agriculture applications.

Details

Motivation: Existing agricultural parcel extraction studies focus on mid-resolution mapping or regular plain farmlands, lacking representation of complex terraced terrains needed for precision agriculture applications.

Method: Created GTPBD dataset with 47,537 high-resolution images covering major worldwide terraced regions, featuring three-level labels: pixel-level boundary labels, mask labels, and parcel labels. Covers seven geographic zones in China and transcontinental climatic regions.

Result: Dataset introduces challenges due to terrain diversity, complex irregular parcel objects, and multiple domain styles. Benchmarked on 8 semantic segmentation methods, 4 edge extraction methods, 3 parcel extraction methods, and 5 UDA methods with multi-dimensional evaluation framework.

Conclusion: GTPBD fills critical gap in terraced remote sensing research, providing infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer, supporting four different computer vision tasks.

Abstract: Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture.In this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manual annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world.Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction, and unsupervised domain adaptation (UDA) tasks.Accordingly, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods, along with a multi-dimensional evaluation framework integrating pixel-level and object-level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer.

Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, Kun Fu, Xian Sun

Main category: cs.CV

TL;DR: RingMo-Agent is a unified vision-language framework for remote sensing that handles multi-modal, multi-platform data and performs both perception and reasoning tasks based on textual instructions.

Details

Motivation: Existing RS vision-language methods rely on homogeneous data sources and are limited to basic perception tasks like classification/captioning, failing to serve as a unified framework for diverse real-world RS imagery from different sensors and platforms.

Method: 1) Uses RS-VL3M dataset (3M+ image-text pairs across optical, SAR, IR modalities from satellite/UAV platforms); 2) Learns modality-adaptive representations with separated embedding layers to reduce cross-modal interference; 3) Unifies task modeling with task-specific tokens and token-based high-dimensional hidden state decoding for long-horizon spatial tasks.

Result: Extensive experiments show RingMo-Agent is effective for both visual understanding and sophisticated analytical tasks, and exhibits strong generalizability across different platforms and sensing modalities.

Conclusion: RingMo-Agent addresses limitations of existing RS vision-language methods by providing a unified framework capable of handling diverse multi-modal, multi-platform data for both perception and reasoning tasks based on user instructions.

Abstract: Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities.

[352] Conditional Diffusion Model with Anatomical-Dose Dual Constraints for End-to-End Multi-Tumor Dose Prediction

Hui Xie, Haiqin Hu, Lijuan Ding, Qing Li, Yue Sun, Tao Tan

Main category: cs.CV

TL;DR: ADDiff-Dose: A conditional diffusion model for automated radiotherapy dose prediction that outperforms traditional methods with better accuracy, faster planning, and improved clinical constraint compliance.

Details

Motivation: Current radiotherapy planning is time-consuming and expert-dependent, while existing deep learning methods have poor generalization, accuracy, and clinical applicability limitations.

Method: Anatomical-Dose Dual Constraints Conditional Diffusion Model using LightweightVAE3D for CT compression, multimodal inputs (target/OAR masks, beam parameters), progressive noise addition/denoising with multi-head attention, and composite loss function (MSE + conditional terms + KL divergence).

Result: Outperforms baselines with MAE 0.101-0.154 (vs 0.316 for UNet, 0.169 for GAN), DICE 0.927 (6.8% improvement), spinal cord max dose error <0.1 Gy, planning time reduced to 22 seconds per case, and 28.5% better clinical constraint compliance.

Conclusion: First conditional diffusion model for radiotherapy dose prediction offering generalizable, efficient automated planning across diverse tumor sites, with potential to substantially reduce planning time and improve clinical workflow.

Abstract: Radiotherapy treatment planning often relies on time-consuming, trial-and-error adjustments that heavily depend on the expertise of specialists, while existing deep learning methods face limitations in generalization, prediction accuracy, and clinical applicability. To tackle these challenges, we propose ADDiff-Dose, an Anatomical-Dose Dual Constraints Conditional Diffusion Model for end-to-end multi-tumor dose prediction. The model employs LightweightVAE3D to compress high-dimensional CT data and integrates multimodal inputs, including target and organ-at-risk (OAR) masks and beam parameters, within a progressive noise addition and denoising framework. It incorporates conditional features via a multi-head attention mechanism and utilizes a composite loss function combining MSE, conditional terms, and KL divergence to ensure both dosimetric accuracy and compliance with clinical constraints. Evaluation on a large-scale public dataset (2,877 cases) and three external institutional cohorts (450 cases in total) demonstrates that ADDiff-Dose significantly outperforms traditional baselines, achieving an MAE of 0.101-0.154 (compared to 0.316 for UNet and 0.169 for GAN models), a DICE coefficient of 0.927 (a 6.8% improvement), and limiting spinal cord maximum dose error to within 0.1 Gy. The average plan generation time per case is reduced to 22 seconds. Ablation studies confirm that the structural encoder enhances compliance with clinical dose constraints by 28.5%. To our knowledge, this is the first study to introduce a conditional diffusion model framework for radiotherapy dose prediction, offering a generalizable and efficient solution for automated treatment planning across diverse tumor sites, with the potential to substantially reduce planning time and improve clinical workflow efficiency.

[353] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation

Kang Liu, Zhuoqi Ma, Zikang Fang, Yunan Li, Kun Xie, Qiguang Miao

Main category: cs.CV

TL;DR: PriorRG: A chest X-ray report generation framework that incorporates patient-specific prior knowledge (clinical context and prior images) to emulate real clinical workflows, outperforming existing methods.

Details

Motivation: Existing chest X-ray report generation methods neglect patient-specific prior knowledge like clinical context and prior images, which radiologists routinely use for diagnostic reasoning. This leads to reports that fail to capture diagnostic intent or track disease progression.

Method: Two-stage training pipeline: 1) Prior-guided contrastive pre-training using clinical context to guide spatiotemporal feature extraction, 2) Prior-aware coarse-to-fine decoding that progressively integrates patient-specific prior knowledge with vision encoder hidden states for report generation.

Result: Outperforms state-of-the-art methods with 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and 5.9% BLEU-1 gain on MIMIC-ABN dataset.

Conclusion: PriorRG successfully bridges the gap between automated report generation and real clinical workflows by incorporating patient-specific prior knowledge, leading to more clinically accurate and fluent reports that better capture diagnostic focus and disease progression.

Abstract: Chest X-ray report generation aims to reduce radiologists’ workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge – including clinical context (e.g., symptoms, medical history) and the most recent prior image – which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder’s hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.

[354] PMGS: Reconstruction of Projectile Motion Across Large Spatiotemporal Spans via 3D Gaussian Splatting

Yijun Xu, Jingrui Zhang, Yuhan Chen, Dingwen Wang, Lei Yu, Chu He

Main category: cs.CV

TL;DR: PMGS reconstructs projectile motion using 3D Gaussian Splatting with two-stage workflow: target modeling and motion recovery, incorporating physics constraints and adaptive optimization.

Details

Motivation: Existing dynamic reconstruction methods are limited to short-term, small-scale deformation with poor physical consistency, failing to handle complex rigid motion across large spatiotemporal spans.

Method: Two-stage approach: 1) Target modeling via dynamic scene decomposition and improved point density control; 2) Motion recovery learning per-frame SE(3) poses with acceleration consistency constraint, dynamic simulated annealing, and Kalman fusion for error optimization.

Result: PMGS demonstrates superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.

Conclusion: PMGS effectively addresses the challenge of modeling complex rigid motion across large spatiotemporal spans by integrating physics-based constraints and adaptive optimization strategies within a 3D Gaussian Splatting framework.

Abstract: Modeling complex rigid motion across large spatiotemporal spans remains an unresolved challenge in dynamic reconstruction. Existing paradigms are mainly confined to short-term, small-scale deformation and offer limited consideration for physical consistency. This study proposes PMGS, focusing on reconstructing Projectile Motion via 3D Gaussian Splatting. The workflow comprises two stages: 1) Target Modeling: achieving object-centralized reconstruction through dynamic scene decomposition and an improved point density control; 2) Motion Recovery: restoring full motion sequences by learning per-frame SE(3) poses. We introduce an acceleration consistency constraint to bridge Newtonian mechanics and pose estimation, and design a dynamic simulated annealing strategy that adaptively schedules learning rates based on motion states. Furthermore, we devise a Kalman fusion scheme to optimize error accumulation from multi-source observations to mitigate disturbances. Experiments show PMGS’s superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.

[355] AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

Weichen Zhang, Zhui Zhu, Ningbo Li, Shilong Tao, Kebin Liu, Yunhao Liu

Main category: cs.CV

TL;DR: AdaptInfer: A plug-and-play framework for adaptive vision token pruning in VLMs that reduces inference cost by 61.3% while maintaining 93.1% accuracy on LLaVA-1.5-7B.

Details

Motivation: Vision-language models have high inference costs due to processing many vision tokens. Existing pruning methods fail to exploit dynamic internal signals generated during inference, relying instead on static attention patterns or text prompts.

Method: 1) Fine-grained dynamic text-guided pruning that reuses layer-wise text-to-text attention maps to create soft priors for scoring vision tokens. 2) Offline analysis of cross-modal attention shifts to identify consistent inflection points, enabling a principled pruning schedule.

Result: Reduces CUDA latency by 61.3% while maintaining 93.1% average accuracy on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses state-of-the-art methods in accuracy.

Conclusion: AdaptInfer provides an effective, lightweight, plug-and-play solution for adaptive vision token pruning that generalizes across multimodal tasks and significantly reduces inference costs while maintaining performance.

Abstract: Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 93.1% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.

[356] SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models

Hanqing Wang, Yuan Tian, Mingyu Liu, Zhenhao Zhang, Xiangyang Zhu

Main category: cs.CV

TL;DR: SDEval is a dynamic safety evaluation framework for Multimodal Large Language Models that generates new test samples through text, image, and text-image transformations to address outdated benchmarks and data contamination issues.

Details

Motivation: Existing safety datasets for MLLMs become outdated quickly with model advancements and suffer from data contamination, creating a need for a dynamic evaluation framework that can adapt to evolving models and mitigate contamination issues.

Method: SDEval uses three dynamic strategies: text dynamics (modifying text content), image dynamics (modifying images), and text-image dynamics (cross-modal injection). These generate new samples from existing benchmarks while maintaining evaluation validity.

Result: Experiments show SDEval significantly impacts safety evaluation results, effectively mitigates data contamination problems, and exposes previously hidden safety limitations in MLLMs across multiple benchmarks including MLLMGuard, VLSBench, MMBench, and MMVet.

Conclusion: SDEval provides a flexible, generalizable framework for dynamic safety evaluation that addresses key limitations of static benchmarks, offering a more robust approach to assessing MLLM safety in evolving contexts.

Abstract: In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbf{SDEval}, the \textit{first} safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at https://github.com/hq-King/SDEval

[357] A Mutual-Structure Weighted Sub-Pixel Multimodal Optical Remote Sensing Image Matching Method

Tao Huang, Hongbo Pan, Nanxi Zhou, Siyuan Zou, Shun Zhou

Main category: cs.CV

TL;DR: PCWLAD is a coarse-to-fine framework for sub-pixel matching of multimodal optical images that uses phase congruency mutual-structure weighting and least absolute deviation to handle structural noise and inconsistencies.

Details

Motivation: Structural noise and inconsistencies in multimodal image responses limit matching accuracy, creating a need for robust sub-pixel matching methods for combined sensor applications.

Method: Two-stage approach: coarse matching preserves complete structure with enhanced cross-modal similarity and PC noise filtering; fine matching uses mutual-structure filtering and weighted least absolute deviation for adaptive sub-pixel displacement estimation.

Result: Outperforms eight state-of-the-art methods on three multimodal datasets (Landsat visible-infrared, short-range visible-near-infrared, UAV optical), achieving ~0.4 pixel average matching accuracy.

Conclusion: PCWLAD effectively addresses multimodal matching challenges and provides publicly available software and datasets for community use.

Abstract: Sub-pixel matching of multimodal optical images is a critical step in combined application of multiple sensors. However structural noise and inconsistencies arising from variations in multimodal image responses usually limit the accuracy of matching. Phase congruency mutual-structure weighted least absolute deviation (PCWLAD) is developed as a coarse-to-fine framework. In the coarse matching stage, we preserve the complete structure and use an enhanced cross-modal similarity criterion to mitigate structural information loss by PC noise filtering. In the fine matching stage, a mutual-structure filtering and weighted least absolute deviation-based is introduced to enhance inter-modal structural consistency and accurately estimate sub-pixel displacements adaptively. Experiments on three multimodal datasets-Landsat visible-infrared, short-range visible-near-infrared, and UAV optical image pairs demonstrate that PCWLAD consistently outperforms eight state-of-the-art methods, achieving an average matching accuracy of approximately 0.4 pixels. The software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.

[358] Learning the Language of Histopathology Images reveals Prognostic Subgroups in Invasive Lung Adenocarcinoma Patients

Abdul Rehman Akbar, Usama Sajjad, Ziyu Su, Wencheng Li, Fei Xing, Jimmy Ruiz, Wei Chen, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: PathRosetta is an AI model that treats histopathology as a language (cells as words, neighborhoods as syntax, tissue as sentences) to predict 5-year recurrence in lung adenocarcinoma from H&E slides, outperforming traditional grading/staging systems and other AI models while providing interpretable predictions.

Details

Motivation: Recurrence in surgically resected invasive lung adenocarcinoma remains a major clinical challenge, and existing grading/staging systems fail to capture the cellular complexity underlying tumor aggressiveness.

Method: PathRosetta conceptualizes histopathology as a language: cells as words, spatial neighborhoods as syntactic structures, and tissue architecture as sentences. It learns this language to predict five-year recurrence directly from H&E slides, treating them as documents representing disease state.

Result: Achieved AUC of 0.78±0.04 on internal cohort (289 patients, 600 slides), significantly outperforming IASLC grading (AUC:0.71), AJCC staging (AUC:0.64), and other AI models (AUC:0.62-0.67). Hazard ratio of 9.54, concordance index of 0.70. Generalized robustly to external TCGA (AUC:0.75) and CPTAC (AUC:0.76) cohorts, performed consistently across demographic/clinical subgroups. Uncovered prognostic subgroups within individual cell types.

Conclusion: Representing histopathology as a language enables interpretable and generalizable prognostication from routine histology, with PathRosetta demonstrating superior performance over existing systems while providing inherent interpretability through its understanding of cell types, neighborhoods, and tissue morphology.

Abstract: Recurrence remains a major clinical challenge in surgically resected invasive lung adenocarcinoma, where existing grading and staging systems fail to capture the cellular complexity that underlies tumor aggressiveness. We present PathRosetta, a novel AI model that conceptualizes histopathology as a language, where cells serve as words, spatial neighborhoods form syntactic structures, and tissue architecture composes sentences. By learning this language of histopathology, PathRosetta predicts five-year recurrence directly from hematoxylin-and-eosin (H&E) slides, treating them as documents representing the state of the disease. In a multi-cohort dataset of 289 patients (600 slides), PathRosetta achieved an area under the curve (AUC) of 0.78 +- 0.04 on the internal cohort, significantly outperforming IASLC grading (AUC:0.71), AJCC staging (AUC:0.64), and other state-of-the-art AI models (AUC:0.62-0.67). It yielded a hazard ratio of 9.54 and a concordance index of 0.70, generalized robustly to external TCGA (AUC:0.75) and CPTAC (AUC:0.76) cohorts, and performed consistently across demographic and clinical subgroups. Beyond whole-slide prediction, PathRosetta uncovered prognostic subgroups within individual cell types, revealing that even within benign epithelial, stromal, or other cells, distinct morpho-spatial phenotypes correspond to divergent outcomes. Moreover, because the model explicitly understands what it is looking at, including cell types, cellular neighborhoods, and higher-order tissue morphology, it is inherently interpretable and can articulate the rationale behind its predictions. These findings establish that representing histopathology as a language enables interpretable and generalizable prognostication from routine histology.

[359] CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation

Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee

Main category: cs.CV

TL;DR: CountCluster improves text-to-image diffusion models’ ability to generate correct number of objects by clustering cross-attention maps at early denoising steps without external tools or training.

Details

Motivation: Current diffusion models struggle to accurately reflect specified object counts in text prompts, and existing methods using external counting modules or learned representations have limitations. The key insight is that object quantity is determined in early denoising steps through cross-attention maps.

Method: CountCluster partitions object cross-attention maps into k clusters based on attention scores during inference, defines an ideal distribution where clusters are spatially well-separated, and optimizes the latent to align with this target distribution without external tools or additional training.

Result: Achieves average 18.5% improvement in object count accuracy compared to existing methods, with superior quantity control performance across various prompts.

Conclusion: The method effectively addresses object counting issues in diffusion models by focusing on early-stage cross-attention map clustering, demonstrating significant improvements without requiring external tools or retraining.

Abstract: Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic–The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster

[360] Unsupervised Stereo via Multi-Baseline Geometry-Consistent Self-Training

Peng Xu, Zhiyu Xiang, Tingming Bai, Tianyu Pu, Kai Wang, Chaojie Ji, Zhihao Yang, Eryun Liu

Main category: cs.CV

TL;DR: S³ framework uses multi-baseline geometry consistency for stereo network training, addressing occlusion supervision issues by leveraging visibility asymmetry between teacher and student networks with different target images.

Details

Motivation: Photometric loss and pseudo-label self-training both fail to provide accurate supervision in occluded regions - photometric loss lacks valid correspondences in occlusions, while pseudo-labels are unreliable in these areas. There's a need for better supervision in occluded regions for stereo network training.

Method: S³ framework uses multi-baseline geometry consistency where teacher and student networks receive different target images, creating visibility asymmetry. Teacher’s disparities are rescaled to align with student’s baseline. Includes occlusion-aware weighting to mitigate unreliable supervision in teacher-occluded regions and encourage robust occlusion completion. Uses MBS20K dataset synthesized from CARLA simulator.

Result: S³ provides effective supervision in both occluded and non-occluded regions, achieves strong generalization performance, and surpasses previous state-of-the-art methods on KITTI 2015 and 2012 benchmarks.

Conclusion: The S³ framework effectively addresses occlusion supervision challenges in stereo network training through multi-baseline geometry consistency and visibility asymmetry, outperforming existing methods on standard benchmarks.

Abstract: Photometric loss and pseudo-label-based self-training are two widely used methods for training stereo networks on unlabeled data. However, they both struggle to provide accurate supervision in occluded regions. The former lacks valid correspondences, while the latter’s pseudo labels are often unreliable. To overcome these limitations, we present S$^3$, a simple yet effective framework based on multi-baseline geometry consistency. Unlike conventional self-training where teacher and student share identical stereo pairs, S$^3$ assigns them different target images, introducing natural visibility asymmetry. Regions occluded in the student’s view often remain visible and matchable to the teacher, enabling reliable pseudo labels even in regions where photometric supervision fails. The teacher’s disparities are rescaled to align with the student’s baseline and used to guide student learning. An occlusion-aware weighting strategy is further proposed to mitigate unreliable supervision in teacher-occluded regions and to encourage the student to learn robust occlusion completion. To support training, we construct MBS20K, a multi-baseline stereo dataset synthesized using the CARLA simulator. Extensive experiments demonstrate that S$^3$ provides effective supervision in both occluded and non-occluded regions, achieves strong generalization performance, and surpasses previous state-of-the-art methods on the KITTI 2015 and 2012 benchmarks.

[361] MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Xuyang Chen, Zhijun Zhai, Kaixuan Zhou, Zengmao Wang, Jianan He, Dong Wang, Yanfeng Zhang, mingwei Sun, Rüdiger Westermann, Konrad Schindler, Liqiu Meng

Main category: cs.CV

TL;DR: MeSS (Mesh-based Scene Synthesis) generates high-quality, style-consistent outdoor scenes using city mesh models as geometric priors, addressing the lack of realistic textures in urban mesh models for virtual navigation and autonomous driving applications.

Details

Motivation: City mesh models lack realistic textures, limiting their use in virtual urban navigation and autonomous driving. While image/video diffusion models can generate street-level views using spatial layouts, they struggle with 3D scene generation, camera path adherence, and cross-view consistency.

Method: Three-stage pipeline: 1) Generate geometrically consistent sparse views using Cascaded Outpainting ControlNets, 2) Propagate denser intermediate views via AGInpaint component, 3) Globally eliminate visual inconsistencies using GCAlign module. Concurrently reconstruct 3D Gaussian Splatting scene by initializing Gaussian balls on mesh surfaces.

Result: Outperforms existing approaches in both geometric alignment and generation quality. Synthesized scenes can be rendered in diverse styles through relighting and style transfer techniques.

Conclusion: MeSS successfully addresses the texture limitation of city mesh models by generating high-quality, style-consistent outdoor scenes with improved geometric alignment and cross-view consistency, enabling better virtual urban navigation and autonomous driving applications.

Abstract: Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques. project page: https://albertchen98.github.io/mess/

[362] COLT: Enhancing Video Large Language Models with Continual Tool Usage

Yuyang Liu, Meng Cao, Xinyuan Shi, Xiaondan Liang

Main category: cs.CV

TL;DR: COLT enhances open-source video LLMs with continual tool usage, enabling automatic acquisition of tool-use abilities from streaming tool data without forgetting previously learned tools.

Details

Motivation: Existing video LLM methods assume fixed tool repositories and struggle with real-world environments where tool data is perpetually evolving and streaming in, requiring continual learning capabilities.

Method: COLT incorporates a learnable tool codebook as tool-specific memory, dynamically selects relevant tools based on similarity between user instruction and tool features, and uses VideoToolBench dataset for instruction tuning.

Result: Extensive experiments on video LLM benchmarks and the VideoToolBench dataset demonstrate state-of-the-art performance in continual tool usage.

Conclusion: COLT successfully enables open-source video LLMs to acquire tool-use abilities in streaming tool environments while avoiding catastrophic forgetting of previously learned tools.

Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering ‘catastrophic forgetting’ of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

[363] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Laiyuan Wang, Hua Zhang, Xiaochun Cao

Main category: cs.CV

TL;DR: EAGLE is a lightweight black-box framework for explaining token generation in multimodal LLMs, attributing tokens to perceptual regions while quantifying language vs. visual influence.

Details

Motivation: Current MLLMs lack understanding of how generated tokens depend on visual modalities, limiting interpretability and reliability. There's a need for methods that can explain autoregressive token generation in these models.

Method: EAGLE uses a unified objective function combining sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions. It performs both spatial attribution and modality-aware analysis to disentangle what tokens rely on.

Result: EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis across open-source MLLMs, while requiring substantially less GPU memory.

Conclusion: EAGLE provides an effective and practical framework for advancing the interpretability of MLLMs through faithful and efficient attribution of token generation to visual inputs.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.

[364] OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction

Hongyang Li, Jinyuan Qu, Lei Zhang

Main category: cs.CV

TL;DR: OVSeg3R learns open-vocabulary 3D instance segmentation from 2D models using 3D reconstruction, avoiding manual annotation and improving performance on tail classes.

Details

Motivation: To enable open-vocabulary 3D instance segmentation without costly manual annotation by leveraging existing 2D perception models and 3D reconstruction from videos.

Method: Uses reconstructed scenes from 2D videos as input, projects 2D instance masks onto 3D via reconstruction correspondences, employs View-wise Instance Partition to handle partial annotations, and introduces 2D Instance Boundary-aware Superpoint clustering to preserve object boundaries.

Result: Improves overall performance by +2.3 mAP on ScanNet200, reduces gap between tail and head classes, and achieves +7.1 mAP improvement on novel classes in open-vocabulary setting.

Conclusion: OVSeg3R effectively bridges 2D open-vocabulary models to 3D segmentation, demonstrating superior performance through innovative handling of 2D-to-3D projection and boundary-aware superpoint clustering.

Abstract: In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view’s 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view’s corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

[365] G2L:From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Knowledge Distillation

Yesung Cho, Sungmin Lee, Geongyu Lee, Minkyung Lee, Jongbae Park, Dongmyung Shin

Main category: cs.CV

TL;DR: G2L framework uses knowledge distillation to boost large-scale pathology models (15% of giga-scale parameters) to giga-scale performance using only 1K target cancer slides, achieving better results than same-size models and sometimes surpassing giga-scale teachers.

Details

Motivation: Giga-scale pathology foundation models with billions of parameters trained on hundreds of thousands of slides face prohibitive computational costs for practical deployment, creating a need for more efficient alternatives that maintain high performance.

Method: Proposed G2L framework applies knowledge distillation to transfer capabilities from a giga-scale teacher model to a large-scale student model (15% of parameters) using only 1,000 pathology slides of a target cancer type (e.g., breast, prostate).

Result: Distilled large-scale models outperformed state-of-the-art same-size models across benchmarks, sometimes surpassed giga-scale teacher and huge-scale models, and showed higher robustness to multi-institutional image variations.

Conclusion: Knowledge distillation enables data- and parameter-efficient achievement of giga-scale performance for cancer-specific applications without prohibitive computational burden, making high-performance pathology AI more practical for deployment.

Abstract: Recent studies in pathology foundation models have shown that scaling training data, diversifying cancer types, and increasing model size consistently improve their performance. However, giga-scale foundation models, which are trained on hundreds of thousands of slides covering tens of cancer types and contain billions of parameters, pose significant challenges for practical use due to their tremendous computational costs in both development and deployment. In this work, we present a novel strategy, named the G2L framework, to increase the performance of large-scale foundation models, which consist of only $15%$ of the parameters of giga-scale models, to a comparable performance level of giga-scale models in cancer-specific tasks. Our approach applies knowledge distillation, transferring the capabilities of a giga-scale model to a large-scale model, using just 1K pathology slides of a target cancer (e.g., breast, prostate, etc.). The resulting distilled model not only outperformed state-of-the-art models of the same size (i.e., large-scale) across several benchmarks but also, interestingly, surpassed the giga-scale teacher and huge-scale models in some benchmarks. In addition, the distilled model exhibited a higher robustness index, indicating improved resilience to image variations originating from multiple institutions. These findings suggest that the proposed distillation approach for a large-scale model is a data- and parameter-efficient way to achieve giga-scale-level performance for cancer-specific applications without prohibitive computational burden.

[366] PICABench: How Far Are We from Physically Realistic Image Editing?

Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu

Main category: cs.CV

TL;DR: PICABench is a new benchmark that evaluates physical realism in image editing across 8 dimensions, revealing current models struggle with physical consistency despite good instruction following.

Details

Motivation: Current image editing models focus on completing editing instructions but overlook accompanying physical effects (shadows, reflections, object interactions), which are crucial for realism. There's a lack of systematic evaluation for physical consistency in editing.

Method: Introduces PICABench with 8 sub-dimensions spanning optics, mechanics, and state transitions for common editing operations. Proposes PICAEval evaluation protocol using VLM-as-a-judge with region-level human annotations. Also creates PICA-100K training dataset by learning physics from videos.

Result: Evaluation of mainstream models shows physical realism remains challenging with large room for improvement. The benchmark reveals current limitations in achieving physically consistent editing.

Conclusion: Physical realism is a key challenge in image editing. PICABench provides foundation for moving from naive content editing to physically consistent realism, with proposed solutions and dataset to advance the field.

Abstract: Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

[367] LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting

Yuchen Su, Zhineng Chen, Yongkun Du, Zuxuan Wu, Hongtao Xie, Yu-Gang Jiang

Main category: cs.CV

TL;DR: LRANet++ is an end-to-end text spotting framework that improves arbitrary-shaped text detection using low-rank approximation for shape representation and a triple assignment head for efficient inference.

Details

Motivation: Current end-to-end text spotters struggle with accurate and efficient detection of arbitrary-shaped text due to unreliable detection methods. The paper aims to address this bottleneck by developing better shape representation and faster inference.

Method: Proposes a parameterized text shape representation using low-rank approximation to capture intrinsic shape patterns from noisy annotations, and a triple assignment detection head with three branches (deep sparse, ultra-lightweight inference, and dense) to decouple training complexity from inference speed.

Result: Extensive experiments on challenging benchmarks show LRANet++ outperforms state-of-the-art methods in both accuracy and efficiency for arbitrary-shaped text spotting.

Conclusion: LRANet++ successfully addresses the detection bottleneck in end-to-end text spotting through robust shape representation and efficient inference design, achieving superior performance on arbitrary-shaped text recognition tasks.

Abstract: End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains challenging. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape representation based on low-rank approximation for precise detection and a triple assignment detection head for fast inference. Specifically, unlike current data-irrelevant shape representation methods, we exploit shape correlations among labeled text boundaries to construct a robust low-rank subspace. By minimizing an $\ell_1$-norm objective, we extract orthogonal vectors that capture the intrinsic text shape from noisy annotations, enabling precise reconstruction via the linear combination of only a few basis vectors. Next, the triple assignment scheme decouples training complexity from inference speed. It utilizes a deep sparse branch to guide an ultra-lightweight inference branch, while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code is available at: https://github.com/ychensu/LRANet-PP.

[368] MPJudge: Towards Perceptual Assessment of Music-Induced Paintings

Shiqi Jiang, Tianyi Liang, Huayuan Ye, Changbo Wang, Chenhui Li

Main category: cs.CV

TL;DR: Proposes MPJudge, a novel framework for assessing perceptual coherence between music and paintings, using a new dataset MPD and Direct Preference Optimization to handle ambiguous cases.

Details

Motivation: Existing methods for evaluating music-induced paintings rely on emotion recognition models which introduce noise and overlook broader perceptual cues beyond just emotion, creating a need for a more comprehensive perceptual assessment framework.

Method: Creates MPD dataset (first large-scale music-painting pairs annotated by experts), develops MPJudge model with modulation-based fusion to integrate music features into visual encoder, and uses Direct Preference Optimization to handle ambiguous cases from pairwise preference annotations.

Result: Extensive experiments show the method outperforms existing approaches, and qualitative results demonstrate that the model more accurately identifies music-relevant regions in paintings.

Conclusion: The proposed framework successfully addresses limitations of emotion-based approaches by directly modeling perceptual coherence between music and visual art, providing a more effective assessment method for music-induced paintings.

Abstract: Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.

[369] Improving VisNet for Object Recognition

Mehdi Fatan Serj, C. Alejandro Parraga, Xavier Otazu

Main category: cs.CV

TL;DR: Enhanced VisNet variants with radial basis function neurons, Mahalanobis distance learning, and retinal preprocessing improve object recognition and symmetry classification accuracy over baseline VisNet across multiple datasets.

Details

Motivation: To bridge the gap between biological visual systems' remarkable efficiency and artificial systems' capabilities by investigating biologically inspired neural networks for object recognition and symmetry detection.

Method: Enhanced VisNet variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing, leveraging Hebbian learning and temporal continuity to build invariant representations.

Result: Enhanced VisNet variants substantially improve recognition accuracy compared with baseline model across MNIST, CIFAR10, and custom symmetric object datasets.

Conclusion: VisNet-inspired architectures offer a powerful and interpretable framework for visual recognition with biological relevance, adaptable for both neuroscience and artificial intelligence applications.

Abstract: Object recognition plays a fundamental role in how biological organisms perceive and interact with their environment. While the human visual system performs this task with remarkable efficiency, reproducing similar capabilities in artificial systems remains challenging. This study investigates VisNet, a biologically inspired neural network model, and several enhanced variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing for both general object recognition and symmetry classification. By leveraging principles of Hebbian learning and temporal continuity associating temporally adjacent views to build invariant representations. VisNet and its extensions capture robust and transformation invariant features. Experimental results across multiple datasets, including MNIST, CIFAR10, and custom symmetric object sets, show that these enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model. These findings underscore the adaptability and biological relevance of VisNet inspired architectures, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence. Keywords: VisNet, Object Recognition, Symmetry Detection, Hebbian Learning, RBF Neurons, Mahalanobis Distance, Biologically Inspired Models, Invariant Representations

[370] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting

Aymen Mir, Jian Wang, Riza Alp Guler, Chuan Guo, Gerard Pons-Moll, Bing Zhou

Main category: cs.CV

TL;DR: A framework using 3D Gaussian Splatting for animating humans in 3D scenes, enabling geometry-consistent free-viewpoint rendering without paired human-scene data.

Details

Motivation: 3DGS achieves state-of-the-art photorealistic novel-view synthesis but remains under-explored for human-scene animation and interaction. Existing pipelines use meshes or point clouds, but 3DGS offers advantages for geometry-consistent rendering.

Method: 1) Represent humans and scenes as 3D Gaussians; 2) Decouple rendering from motion synthesis; 3) Gaussian-aligned motion module using opacity-based cues and projected Gaussian structures; 4) Human-scene Gaussian refinement optimization for realistic contact and navigation.

Result: Evaluated on Scannet++ and SuperSplat scenes with avatars from sparse/dense multi-view capture. Enables novel applications like geometry-consistent free-viewpoint rendering of edited monocular RGB videos with newly animated humans.

Conclusion: The framework demonstrates unique advantages of 3DGS for monocular video-based human animation, enabling realistic human-scene interactions without paired training data.

Abstract: We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation for animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that rendering can be decoupled from motion synthesis, and each sub-problem can be addressed independently without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework enables novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with newly animated humans, showcasing the unique advantages of 3DGS for monocular video-based human animation. To assess the full quality of our results, we encourage readers to view the supplementary material available at https://miraymen.github.io/aha/ .

[371] Data-Augmented Multimodal Feature Fusion for Multiclass Visual Recognition of Oral Cancer Lesions

Joy Naoum, Revana Salama, Ali Hamdi

Main category: cs.CV

TL;DR: A multimodal feature-fusion framework with data augmentation for VR-assisted oral cancer recognition, achieving improved accuracy over single-modality approaches.

Details

Motivation: Oral cancer is often diagnosed late due to lesion similarity. Existing deep learning approaches are limited by small, imbalanced datasets and single-modality features, restricting real-world clinical generalization.

Method: Proposes a data-augmentation driven multimodal feature-fusion framework integrated within a VR-assisted system. Combines extensive data-centric augmentation with fused clinical and image-based representations using a stratified training pipeline and EfficientNetV2 B1 backbone.

Result: Achieves 82.57% accuracy on 2 classes, 65.13% on 3 classes, and 54.97% on 4 classes, outperforming traditional single stream CNN models.

Conclusion: Multimodal feature fusion with strategic augmentation is effective for reliable early oral cancer lesion recognition and serves as a foundation for immersive VR-based clinical decision support tools.

Abstract: Oral cancer is frequently diagnosed at later stages due to its similarity to other lesions. Existing research on computer aided diagnosis has made progress using deep learning; however, most approaches remain limited by small, imbalanced datasets and a dependence on single-modality features, which restricts model generalization in real-world clinical settings. To address these limitations, this study proposes a novel data-augmentation driven multimodal feature-fusion framework integrated within a (Vision Recognition)VR assisted oral cancer recognition system. Our method combines extensive data centric augmentation with fused clinical and image-based representations to enhance model robustness and reduce diagnostic ambiguity. Using a stratified training pipeline and an EfficientNetV2 B1 backbone, the system improves feature diversity, mitigates imbalance, and strengthens the learned multimodal embeddings. Experimental evaluation demonstrates that the proposed framework achieves an overall accuracy of 82.57 percent on 2 classes, 65.13 percent on 3 classes, and 54.97 percent on 4 classes, outperforming traditional single stream CNN models. These results highlight the effectiveness of multimodal feature fusion combined with strategic augmentation for reliable early oral cancer lesion recognition and serve as a foundation for immersive VR based clinical decision support tools.

[372] Wukong’s 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

Minghao Yin, Yukang Cao, Kai Han

Main category: cs.CV

TL;DR: WUKONG is a training-free framework for high-fidelity textured 3D morphing using flow-based transformers, achieving smooth shape transitions and faithful texture preservation without manual correspondence matching.

Details

Motivation: Conventional 3D morphing methods rely on manual correspondence matching and deformation trajectory estimation, which limits generalization and requires costly preprocessing. There's a need for a more efficient, training-free approach that can produce high-fidelity 3D transitions with rich texture details.

Method: WUKONG leverages flow-based transformers’ generative prior for 3D transitions. It formulates morphing as an optimal transport barycenter problem for smooth shape transitions, uses sequential initialization to prevent geometric distortions, and employs similarity-guided semantic consistency for texture preservation with selective high-frequency detail retention.

Result: Extensive evaluations show WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations. It supports both global texture transitions and identity-preserving texture morphing.

Conclusion: WUKONG provides a novel training-free framework for high-fidelity textured 3D morphing that overcomes limitations of conventional methods, offering better generalization, reduced preprocessing costs, and superior quality results with precise control over blending dynamics.

Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods – which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) – WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This empowers WUKONG to support both global texture transitions and identity-preserving texture morphing, catering to diverse generation needs. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

[373] Diminishing Returns in Self-Supervised Learning

Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D’Ornano, Shomit Basu

Main category: cs.CV

TL;DR: Small Vision Transformers (5M params) show that intermediate classification fine-tuning harms semantic segmentation performance, especially when pre-training is most effective, due to collapsing spatial structure needed for dense prediction.

Details

Motivation: To understand how self-supervised pre-training, intermediate fine-tuning, and downstream fine-tuning interact in low-capacity models, challenging the assumption that more training stages always help performance.

Method: Used 5M-parameter Vision Transformer for semantic segmentation across multiple data scales. Analyzed masked image modeling pre-training, intermediate classification fine-tuning, and downstream segmentation fine-tuning. Conducted patch-level representation geometry analysis to examine spatial structure.

Result: Masked image modeling pre-training and downstream fine-tuning improve performance with diminishing returns as supervision increases. Intermediate classification fine-tuning consistently degrades downstream performance, with largest drops where pre-training is most effective. Classification supervision collapses spatial structure critical for dense prediction.

Conclusion: In small models, supervision geometry matters more than number of training stages. Misaligned intermediate objectives can negate pre-training benefits rather than amplify them, highlighting the importance of task alignment in transfer learning pipelines.

Abstract: Transformer-based architectures have become a dominant paradigm in vision and language, but their success is often attributed to large model capacity and massive training data. In this work, we examine how self-supervised pre-training, intermediate fine-tuning, and downstream fine-tuning interact in a low-capacity regime, using a 5M-parameter Vision Transformer for semantic segmentation. Across multiple data scales, we find that masked image modeling pre-training and downstream fine-tuning reliably improve performance, but with clear diminishing returns as supervision increases. In contrast, inserting an intermediate classification fine-tuning stage consistently degrades downstream performance, with the largest drops occurring precisely where pre-training is most effective. Through an analysis of patch-level representation geometry, we show that classification-based intermediate supervision actively interferes with representations learned during pre-training by collapsing spatial structure critical for dense prediction. These results indicate that, in small models, the geometry of supervision matters more than the number of training stages: misaligned intermediate objectives can negate the benefits of pre-training rather than amplify them.

[374] Ideal Observer for Segmentation of Dead Leaves Images

Swantje Mahncke, Malte Ott

Main category: cs.CV

TL;DR: The paper presents a Bayesian ideal observer framework for image segmentation using dead leaves models, providing theoretical foundations and computational feasibility analysis for principled performance upper bounds.

Details

Motivation: To develop a principled theoretical framework for studying visual segmentation decisions by creating an ideal observer model that can serve as an upper-bound performance benchmark for comparing human vision and computer vision algorithms.

Method: The authors extend previous work on dead leaves models (generative models with object position, shape, color, and texture distributions) to derive a Bayesian ideal observer for pixel partitioning. They provide step-by-step explanations for computing posterior probabilities and analyze factors affecting practical computational feasibility.

Result: The paper develops a theoretical approach for computing posterior probabilities in dead leaves segmentation models, identifying key factors that determine whether the computation can be practically applied to study segmentation in limited pixel sets.

Conclusion: The dead leaves image model with its associated ideal observer provides a principled upper-bound on segmentation performance, enabling systematic comparison between human visual segmentation and computational vision algorithms in controlled settings.

Abstract: The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider “dead leaves” models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects (“leaves”) from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.

[375] VisualActBench: Can VLMs See and Act like a Human?

Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo

Main category: cs.CV

TL;DR: The paper introduces Visual Action Reasoning as a new task and VisualActBench as a benchmark to evaluate VLMs’ ability to proactively reason and act from visual inputs without textual prompts, revealing significant gaps in human-aligned reasoning.

Details

Motivation: Current Vision-Language Models excel at perceiving and describing visual scenes but lack the ability to proactively reason and take appropriate actions based solely on visual inputs, which is crucial for real-world AI agents that need to anticipate outcomes and make human-aligned decisions.

Method: The authors create VisualActBench, a large-scale benchmark with 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with Action Prioritization Level (APL) and proactive-reactive type. They evaluate 29 VLMs on this benchmark.

Result: While frontier models like GPT4o show relatively strong performance, there remains a significant gap compared to human-level reasoning, especially in generating proactive, high-priority actions. Current VLMs struggle with complex context interpretation, outcome anticipation, and human decision-making alignment.

Conclusion: VisualActBench provides a comprehensive foundation for assessing and improving proactive, vision-centric AI agents. The benchmark reveals critical limitations in current VLMs’ real-world readiness and establishes a pathway for developing more human-aligned, proactive reasoning capabilities in vision-language models.

Abstract: Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models’ human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs’ ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

[376] Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method

Ge Zhang, Chunyang Wang, Bin Liu, Guan Xi

Main category: cs.CV

TL;DR: Proposes an adaptive dual-weight gravitational-based point cloud denoising method that achieves high accuracy, strong edge preservation, and real-time performance through octree spatial partitioning, adaptive voxel statistics, and gravitational scoring.

Details

Motivation: LiDAR point clouds are often noisy, degrading object detection accuracy. Existing methods trade off between computational efficiency and denoising quality, failing to simultaneously achieve high accuracy, edge preservation, and real-time performance.

Method: 1) Octree spatial partitioning for parallel acceleration; 2) Adaptive voxel occupancy statistics and kNN density estimation to remove isolated/low-density noise; 3) Gravitational scoring function combining density weights with adaptive distance weights to distinguish noise from object points.

Result: Experiments on Stanford 3D, CADC, and RUBY PLUS datasets show consistent improvements in F1, PSNR, and Chamfer Distance across various noise conditions while reducing single-frame processing time, validating high accuracy, robustness, and real-time performance.

Conclusion: The proposed adaptive dual-weight gravitational-based method successfully addresses the trade-off between denoising accuracy and computational efficiency, achieving simultaneous high accuracy, strong edge preservation, and real-time performance in multi-noise scenarios.

Abstract: High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dualweight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house RUBY PLUS LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.

[377] CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

Weijian Ma, Shizhao Sun, Ruiyu Wang, Jiang Bian

Main category: cs.CV

TL;DR: CADMorph is an AI framework for geometry-driven parametric CAD editing that uses pretrained foundation models to synchronize geometric shape changes with underlying parametric sequences while preserving structure, ensuring validity, and maintaining shape fidelity.

Details

Motivation: During iterative CAD design, geometric adjustments require synchronized edits to underlying parametric sequences, but this is challenging due to the need to preserve sequence structure, ensure semantic validity, maintain shape fidelity, and work with scarce editing data.

Method: CADMorph uses an iterative plan-generate-verify framework with two pretrained models: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. Planning uses P2S cross-attention to identify modification segments, generation uses MPP for semantically valid edits, and verification uses P2S to measure shape similarity.

Result: CADMorph surpasses GPT-4o and specialized CAD baselines, supports downstream applications like iterative editing and reverse-engineering enhancement, and works without requiring scarce editing triplet data for training.

Conclusion: CADMorph effectively solves geometry-driven parametric CAD editing by leveraging pretrained domain-specific foundation models to address structure preservation, semantic validity, and shape fidelity simultaneously, overcoming data scarcity limitations.

Abstract: A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence’s structure, 2) ensuring each edit’s semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.

[378] Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

Karthikeya KV

Main category: cs.CV

TL;DR: Vision-Enhanced LLM framework with rectified flow mechanism and bidirectional tokenization achieves superior high-resolution image synthesis and multimodal understanding with 25% better clarity and 20% lower computation than diffusion methods.

Details

Motivation: To address challenges in high-resolution image synthesis and multimodal data interpretation by integrating Vision-Enhanced LLMs with advanced transformer architectures, overcoming limitations of existing diffusion-based methods.

Method: Proposes a framework with rectified flow mechanism (linear noise-data paths), bidirectional tokenization for text-image-video fusion, spatial-temporal feature embedding, hybrid text-image sequence modeling, and noise-aware learning algorithm.

Result: Achieves 25% increase in image resolution clarity and 20% reduction in computational requirements compared to diffusion-based methods, with robust scalability and adaptability across diverse applications.

Conclusion: Vision-centric LLMs with the proposed framework redefine capabilities in computer vision and multimodal AI, showing strong potential for autonomous systems, creative content generation, and advanced video analysis.

Abstract: This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.

[379] Spinal Line Detection for Posture Evaluation through Train-ing-free 3D Human Body Reconstruction with 2D Depth Images

Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Changgyun Kim, Taemin Lee

Main category: cs.CV

TL;DR: Proposes a 3D body posture analysis system using four depth images to restore 3D human model and estimate spine center line, overcoming limitations of single/multi-image methods.

Details

Motivation: Existing multi-image body restoration requires expensive equipment/complex procedures, while single-image methods struggle with occlusion and viewpoint limitations for accurate internal structure estimation like spine center line.

Method: Integrates depth images from four directions, uses hierarchical matching (global and fine registration), applies Adaptive Vertex Reduction for mesh resolution/shape reliability, and uses Level of Detail ensemble for spinal angle estimation accuracy/stability.

Result: Achieves high-precision 3D spine registration estimation without training data or complex neural networks, with verification confirming improved matching quality.

Conclusion: Proposed system compensates for multi-image method shortcomings and solves single-image limitations, enabling accurate 3D body restoration and spine center line estimation.

Abstract: The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.

[380] Test-Time Modification: Inverse Domain Transformation for Robust Perception

Arpit Jadon, Joshua Niemeijer, Yuki M. Asano

Main category: cs.CV

TL;DR: Using diffusion models at test time to map target images back to source distribution for domain generalization, achieving significant improvements across segmentation, detection, and classification tasks.

Details

Motivation: Generative foundation models have broad visual knowledge but synthesizing comprehensive target-domain variations for data augmentation is slow, expensive, and incomplete. There's a need for more efficient domain generalization approaches.

Method: Proposes using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. Requires only source domain description, preserves task model, and eliminates large-scale synthetic data generation.

Result: Demonstrates consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios. Achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.

Conclusion: Test-time diffusion-based domain mapping provides an effective alternative to data augmentation for domain generalization, offering significant performance improvements while being more efficient and preserving existing models.

Abstract: Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.

[381] DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa

Main category: cs.CV

TL;DR: DISCODE is a finetuning-free method for robust image caption evaluation using LVLMs, featuring test-time adaptive evaluation with Gaussian prior distribution and achieving SOTA performance across multiple benchmarks.

Details

Motivation: Current LVLMs struggle with robust image caption evaluation, especially under domain-shift scenarios where evaluation metrics fail to align with human judgments across diverse domains.

Method: DISCODE introduces test-time adaptive evaluation with Adaptive Test-Time (ATT) loss using Gaussian prior distribution, which is efficiently minimized at test time via an analytical solution. The method is finetuning-free and adapts to different domains during evaluation.

Result: DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across the new MCEval benchmark (covering 6 domains) and four existing benchmarks, showing better alignment with human judgments.

Conclusion: DISCODE provides a robust, finetuning-free solution for image caption evaluation that adapts to domain shifts and better aligns with human judgments, addressing a key challenge in LVLM-based evaluation.

Abstract: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

[382] Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

Henghui Du, Chunjie Zhang, Xi Chen, Chang Zhou, Di Hu

Main category: cs.CV

TL;DR: VideoDetective: An efficient question-aware memory mechanism for MLLMs to process long videos by iteratively compressing sub-segments with special memory tokens and aggregating history context.

Details

Motivation: Long Video QA is challenging for MLLMs due to immense context, overloaded information, and prohibitive memory consumption. Existing methods that reduce visual tokens or extend context length may miss useful information or require heavy computation, while only small amounts of crucial information are actually needed for answering questions.

Method: Proposes VideoDetective with question-aware memory mechanism: 1) Iteratively processes video sub-segments, 2) Uses question-aware compression strategy with special memory tokens to purposefully compress visual information while seeking critical clues, 3) Recurrently aggregates and stores memory tokens to update history context for reuse in subsequent sub-segments.

Result: Enables MLLMs with 32K context length to efficiently process 100K tokens (3600 frames, hour-long video at 1fps) in only 2 minutes using 37GB GPU memory. Outperforms on multiple long video benchmarks by effectively seeking critical clues from massive information. Also introduces GLVC dataset for better evaluation of long video understanding.

Conclusion: VideoDetective provides an efficient solution for long video QA by enabling MLLMs to recurrently seek critical clues through question-aware compression and memory aggregation, addressing memory and computational constraints while maintaining effectiveness in extracting relevant information from lengthy videos.

Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model’s context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model’s long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

[383] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, Daquan Zhou, Taku Komura, Zhiyang Dou

Main category: cs.CV

TL;DR: EgoReAct: First autoregressive framework for generating 3D-aligned human reaction motions from egocentric video streams in real-time, using a novel spatially-aligned dataset (HRD) and GPT-based generation with 3D dynamic features.

Details

Motivation: Existing datasets for modeling human reactions from egocentric video suffer from spatial inconsistency (e.g., dynamic motions paired with fixed-camera videos), making it challenging to achieve both causal generation and precise 3D spatial alignment.

Method: 1) Construct Human Reaction Dataset (HRD) with spatially aligned egocentric video-reaction pairs; 2) Use Vector Quantised-Variational AutoEncoder to compress reaction motion into compact latent space; 3) Train Generative Pre-trained Transformer for autoregressive reaction generation from visual input; 4) Incorporate 3D dynamic features (metric depth and head dynamics) to enhance spatial grounding.

Result: EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods while maintaining strict causality during generation. The framework operates in real-time.

Conclusion: EgoReAct successfully addresses the dual challenges of causal generation and 3D spatial alignment for human reaction modeling from egocentric video, enabled by the novel HRD dataset and the autoregressive framework with 3D dynamic features.

Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.

[384] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu

Main category: cs.CV

TL;DR: Proposes a training-free, two-stage data pruning method for remote sensing diffusion foundation models that selects high-quality subsets under high pruning ratios to improve training efficiency and model performance.

Details

Motivation: Existing remote sensing diffusion foundation models rely on large datasets with redundancy, noise, and class imbalance, which reduces training efficiency and prevents convergence. Current approaches use simplistic deduplication or aggregate multiple datasets without considering distributional requirements of generation modeling and RS imagery heterogeneity.

Method: Two-stage data pruning approach: 1) Entropy-based criterion removes low-information samples, 2) Scene-aware clustering with stratified sampling using RS scene classification datasets as reference benchmarks. Balances cluster-level uniformity and sample representativeness for fine-grained selection under high pruning ratios while preserving diversity.

Result: Method significantly improves convergence and generation quality even after pruning 85% of training data. Diffusion foundation models trained with this approach achieve state-of-the-art performance across downstream tasks including super-resolution and semantic image synthesis.

Conclusion: The proposed data pruning paradigm provides practical guidance for developing efficient and effective remote sensing generative foundation models by selecting high-quality subsets that preserve diversity and representativeness while dramatically reducing training data requirements.

Abstract: Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.

[385] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji

Main category: cs.CV

TL;DR: ArtQuant framework with RAD dataset improves aesthetic quality assessment for AIGC by addressing data scarcity and model fragmentation through large-scale structured data and LLM-enhanced multimodal modeling.

Details

Motivation: Aesthetic quality assessment is crucial for human-aligned AIGC evaluation but faces challenges due to its complex nature spanning perception, cognition, and emotion. Existing approaches suffer from data scarcity/imbalance (focusing only on visual perception) and model fragmentation (isolated aesthetic attributes or ineffective long-text processing).

Method: 1) Introduce RAD dataset: large-scale (70k) multi-dimensional structured dataset generated via iterative pipeline without heavy annotation costs. 2) Propose ArtQuant framework: couples isolated aesthetic dimensions through joint description generation and better models long-text semantics using LLM decoders. Theoretical analysis shows symbiosis between data and model minimizes prediction entropy.

Result: Achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs. Effectively narrows the cognitive gap between artistic images and aesthetic judgment.

Conclusion: The proposed ArtQuant framework with RAD dataset successfully addresses key challenges in aesthetic quality assessment, providing a scalable, mathematically-grounded solution that bridges the gap between artistic images and human aesthetic judgment while being computationally efficient.

Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD’s semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.

[386] Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala

Main category: cs.CV

TL;DR: PFP is a neural network that compresses long videos into short contexts while preserving high-frequency details of individual frames, enabling efficient long-term memory for video models.

Details

Motivation: The paper addresses the challenge of handling long videos in autoregressive models, where traditional approaches require large context windows that are computationally expensive. There's a need to compress video sequences while preserving visual details for effective long-term memory in video generation and understanding tasks.

Method: PFP uses a neural network structure with explicit pretraining objective to preserve high-frequency details of single frames at arbitrary temporal positions. It compresses 20-second videos into contexts of about 5k length, allowing random frame retrieval with perceptually preserved appearances. The pretrained models can be fine-tuned as memory encoders for autoregressive video models.

Result: The framework enables long history memory with low context cost and relatively low fidelity loss. The authors evaluate the approach with ablative settings and discuss trade-offs of different neural architecture designs.

Conclusion: PFP provides an effective solution for compressing long videos into compact representations while maintaining visual fidelity, making it suitable for integration with autoregressive video models to handle extended temporal sequences efficiently.

Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

[387] On Exact Editing of Flow-Based Diffusion Models

Zixiang Li, Yue Song, Jianing Peng, Ting Liu, Jun Huang, Xiaochao Qu, Luoqi Liu, Wei Wang, Yao Zhao, Yunchao Wei

Main category: cs.CV

TL;DR: CVC is a flow-based diffusion editing framework that corrects velocity errors in latent trajectories using a dual-perspective velocity conversion mechanism and posterior-consistent updates derived from Empirical Bayes Inference.

Details

Motivation: Current flow-based diffusion editing methods suffer from accumulated velocity errors in latent trajectories, causing semantic inconsistency and loss of structural fidelity during image transformations between source and target distributions.

Method: Proposes Conditioned Velocity Correction (CVC) with: 1) Dual-perspective velocity conversion mechanism that decomposes latent evolution into structure-preserving and semantically-guided branches, 2) Posterior-consistent updates using Empirical Bayes Inference and Tweedie correction to compensate for velocity errors and maintain fidelity to true flow.

Result: CVC achieves stable and interpretable latent dynamics with faithful reconstruction and smooth local semantic conversion. Experiments show superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks compared to existing methods.

Conclusion: CVC provides a principled framework for flow-based editing that addresses velocity error accumulation through mathematically grounded correction mechanisms, enabling more consistent and faithful image transformations between distributions.

Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

cs.AI

[388] Semantic Alignment of Multilingual Knowledge Graphs via Contextualized Vector Projections

Abhishek Kumar

Main category: cs.AI

TL;DR: Cross-lingual ontology alignment system using embedding-based cosine similarity with multilingual transformer models achieves 71% F1 score on OAEI-2022 multifarm track.

Details

Motivation: To improve cross-lingual ontology alignment by capturing subtle semantic similarities between entities in different languages, addressing the limitations of existing baseline methods.

Method: Enrich ontology entities with contextual descriptions using novel techniques, generate embeddings with fine-tuned multilingual transformer models, use cosine similarity for matching, and apply threshold filtering to retain highly similar entity pairs.

Result: Achieved 71% F1 score (78% recall, 65% precision) on OAEI-2022 multifarm track, representing a 16% improvement over the best baseline score.

Conclusion: The proposed alignment pipeline effectively captures cross-lingual similarities, demonstrating significant improvement over existing methods for multilingual ontology alignment tasks.

Abstract: The paper presents our work on cross-lingual ontology alignment system which uses embedding based cosine similarity matching. The ontology entities are made contextually richer by creating descriptions using novel techniques. We use a fine-tuned transformer based multilingual model for generating better embeddings. We use cosine similarity to find positive ontology entities pairs and then apply threshold filtering to retain only highly similar entities. We have evaluated our work on OAEI-2022 multifarm track. We achieve 71% F1 score (78% recall and 65% precision) on the evaluation dataset, 16% increase from best baseline score. This suggests that our proposed alignment pipeline is able to capture the subtle cross-lingual similarities.

[389] MathLedger: A Verifiable Learning Substrate with Ledger-Attested Feedback

Ismail Ahmad Abdullah

Main category: cs.AI

TL;DR: MathLedger is a verifiable machine cognition system combining formal verification, cryptographic attestation, and learning dynamics for transparent AI in safety-critical applications.

Details

Motivation: Current AI systems are opaque and non-verifiable, creating trust issues for safety-critical deployment. There's a need for transparent, auditable AI systems that can be trusted in high-stakes applications.

Method: MathLedger integrates formal verification, cryptographic attestation, and learning dynamics into an epistemic loop using Reflexive Formal Learning (RFL) - a symbolic analogue of gradient descent where updates are driven by verifier outcomes rather than statistical loss.

Result: Phase I experiments validated the measurement and governance substrate under controlled conditions. CAL-EXP-3 validated measurement infrastructure (Delta p computation, variance tracking), and stress tests confirmed fail-closed governance triggers work correctly under out-of-bounds conditions. The system is a working prototype enabling auditability at scale.

Conclusion: MathLedger provides an infrastructural contribution - a ledger-attested learning prototype that enables auditability at scale, addressing the trust crisis in AI deployment for safety-critical applications through verifiable machine cognition.

Abstract: Contemporary AI systems achieve extraordinary performance yet remain opaque and non-verifiable, creating a crisis of trust for safety-critical deployment. We introduce MathLedger, a substrate for verifiable machine cognition that integrates formal verification, cryptographic attestation, and learning dynamics into a single epistemic loop. The system implements Reflexive Formal Learning (RFL), a symbolic analogue of gradient descent where updates are driven by verifier outcomes rather than statistical loss. Phase I experiments validate the measurement and governance substrate under controlled conditions. CAL-EXP-3 validates measurement infrastructure (Delta p computation, variance tracking); separate stress tests confirm fail-closed governance triggers correctly under out-of-bounds conditions. No convergence or capability claims are made. The contribution is infrastructural: a working prototype of ledger-attested learning that enables auditability at scale. Keywords: verifiable learning, formal verification, cryptographic attestation, reflexive feedback, fail-closed governance

[390] Agentic AI for Autonomous, Explainable, and Real-Time Credit Risk Decision-Making

Chandra Sekhar Kubam

Main category: cs.AI

TL;DR: Proposes an Agentic AI framework for autonomous credit risk assessment using multi-agent systems with reinforcement learning, natural language reasoning, and explainable AI modules to improve decision speed, transparency, and responsiveness over traditional models.

Details

Motivation: Digitalization of financial services creates urgent need for autonomous, transparent, real-time credit risk decision systems. Traditional ML models lack adaptive reasoning, situational awareness, and autonomy required for modern financial operations.

Method: Agentic AI framework where AI agents independently observe dynamic credit environments and make decisions with articulable reasoning paths. Multi-agent system with reinforcement learning, natural language reasoning, explainable AI modules, real-time data pipelines, agent collaboration protocols, risk-scoring engines, interpretability layers, and continuous feedback learning cycles.

Result: System demonstrates superior decision speed, transparency, and responsiveness compared to traditional credit scoring models. However, faces limitations including model drift risks, inconsistencies in interpreting high-dimensional data, regulatory uncertainties, and infrastructure limitations in low-resource settings.

Conclusion: Agentic AI framework has high potential to transform credit analytics. Future research should focus on dynamic regulatory compliance mechanisms, enhanced agent teamwork, adversarial robustness, and large-scale implementation across cross-country credit ecosystems.

Abstract: Significant digitalization of financial services in a short period of time has led to an urgent demand to have autonomous, transparent and real-time credit risk decision making systems. The traditional machine learning models are effective in pattern recognition, but do not have the adaptive reasoning, situational awareness, and autonomy needed in modern financial operations. As a proposal, this paper presents an Agentic AI framework, or a system where AI agents view the world of dynamic credit independent of human observers, who then make actions based on their articulable decision-making paths. The research introduces a multi-agent system with reinforcing learning, natural language reasoning, explainable AI modules, and real-time data absorption pipelines as a means of assessing the risk profiles of borrowers with few humans being involved. The processes consist of agent collaboration protocol, risk-scoring engines, interpretability layers, and continuous feedback learning cycles. Findings indicate that decision speed, transparency and responsiveness is better than traditional credit scoring models. Nevertheless, there are still some practical limitations such as risks of model drift, inconsistencies in interpreting high dimensional data and regulatory uncertainties as well as infrastructure limitations in low-resource settings. The suggested system has a high prospective to transform credit analytics and future studies ought to be directed on dynamic regulatory compliance mobilizers, new agent teamwork, adversarial robustness, and large-scale implementation in cross-country credit ecosystems.

[391] CogCanvas: Compression-Resistant Cognitive Artifacts for Long LLM Conversations

Tao An

Main category: cs.AI

TL;DR: CogCanvas is a training-free framework that extracts verbatim-grounded cognitive artifacts from conversations and organizes them into a temporal-aware graph to overcome context window limits while preserving information fidelity.

Details

Motivation: Large language models face a fundamental tension between context window limits and information fidelity in long conversations. Existing approaches (truncation and summarization) either discard early information or lose nuanced details.

Method: CogCanvas extracts verbatim-grounded cognitive artifacts (decisions, facts, reminders) from conversation turns and organizes them into a temporal-aware graph for compression-resistant retrieval, without requiring training.

Result: On LoCoMo benchmark: 34.7% overall accuracy (+9.1pp vs RAG, +21.0pp vs GraphRAG). Temporal reasoning: 31.5% vs 9.3% (RAG) and 5.0% (GraphRAG) - +530% relative improvement. Multi-hop causal reasoning: 81.0% pass rate vs 40.0% for GraphRAG. Controlled benchmarks show 97.5% recall with 93.0% exact match preservation.

Conclusion: While heavily-optimized approaches achieve higher absolute scores through dedicated training, CogCanvas provides practitioners with an immediately-deployable training-free alternative that significantly outperforms standard baselines for long conversation processing.

Abstract: Large language models face a fundamental tension between context window limits and information fidelity in long conversations. Existing approaches–truncation and summarization–either discard early information or lose nuanced details. We introduce CogCanvas, a training-free framework that extracts verbatim-grounded cognitive artifacts (decisions, facts, reminders) from conversation turns and organizes them into a temporal-aware graph for compression-resistant retrieval. On the LoCoMo benchmark, CogCanvas achieves 34.7% overall accuracy, outperforming RAG (25.6%, +9.1pp) and GraphRAG (13.7%, +21.0pp). The advantage is most pronounced on temporal reasoning: 31.5% vs. 9.3% (RAG) and 5.0% (GraphRAG)–a +530% relative improvement. On multi-hop causal reasoning, CogCanvas achieves 81.0% pass rate vs. 40.0% for GraphRAG (+41.0pp). Controlled benchmarks show 97.5% recall (+78.5pp vs. summarization) with 93.0% exact match preservation. While heavily-optimized approaches achieve higher absolute scores through dedicated training (EverMemOS: approximately 92%), our training-free approach provides practitioners with an immediately-deployable alternative that significantly outperforms standard baselines. Code and data: https://github.com/tao-hpu/cog-canvas.

[392] Energy-Aware Routing to Large Reasoning Models

Austin R. Ellis-Mohr, Max Hartman, Lav R. Varshney

Main category: cs.AI

TL;DR: The paper analyzes energy optimization in large reasoning model systems through variance-aware routing and dispatch policies that balance baseline and auxiliary energy usage.

Details

Motivation: Large reasoning models have heterogeneous energy costs that vary by model and reasoning amount. To reduce energy consumption, systems need to intelligently dispatch tasks to different LRMs while balancing mean energy provisioning with stochastic fluctuations.

Method: The paper develops a theoretical framework focusing on the critical regime where neither auxiliary nor baseline energy is wasted. It analyzes variance-aware routing and dispatch policies, characterizing routing behavior based on training-compute and inference-compute scaling laws for LRMs.

Result: The critical regime is identified as the unique operating point that avoids systematic energy waste. Performance in this regime is governed by how variability is absorbed across time, models, and execution choices, highlighting the importance of variance-aware routing.

Conclusion: Variance-aware routing and dispatch provides a principled design axis for energy-aware model routing policies, offering a theoretical basis for developing efficient LRM systems that optimize energy consumption while maintaining performance.

Abstract: Large reasoning models (LRMs) have heterogeneous inference energy costs based on which model is used and how much it reasons. To reduce energy, it is important to choose the right LRM and operate it in the right way. As a result, the performance of systems that dispatch tasks to different individual LRMs depend on the balance between mean energy provisioning and stochastic fluctuations. The critical regime is the unique operating point at which neither auxiliary energy nor baseline energy is systematically wasted. Increasing baseline supply shifts the system toward persistent over-supply and baseline-energy waste, while reducing supply induces persistent reliance on auxiliary energy. Yet in this regime, performance remains volatility-limited and so a second-order characterization provides further insights that we develop. Here, performance is governed by how variability is absorbed across time, models, and execution choices. This perspective highlights variance-aware routing and dispatch as a principled design axis, and provides a theoretical basis for developing energy-aware model routing policies. Routing behavior is characterized when dispatch policies are based on training-compute and inference-compute scaling laws for LRMs.

[393] Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis

Yin Li

Main category: cs.AI

TL;DR: Stronger LLMs make fewer but deeper errors that resist self-correction, creating an Accuracy-Correction Paradox where weaker models achieve higher intrinsic correction rates.

Details

Motivation: To systematically investigate the self-correction capabilities of LLMs, particularly why intrinsic self-correction (without external feedback) remains largely ineffective despite widespread belief in LLMs' self-correction abilities.

Method: Decomposed self-correction into three sub-capabilities: error detection, error localization, and error correction. Conducted cross-model experiments on GSM8K-Complex dataset (n=500 per model, 346 total errors) with three major LLMs (GPT-3.5, DeepSeek, Claude).

Result: Discovered Accuracy-Correction Paradox: weaker models (GPT-3.5, 66% accuracy) achieved 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy) - 26.8% vs 16.7%. Error detection rates varied dramatically (10% to 82%) but didn’t predict correction success. Providing error location hints hurt all models.

Conclusion: Proposed Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self-correction. Findings challenge linear assumptions about model capability and self-improvement, with important implications for designing self-refinement pipelines.

Abstract: Large Language Models (LLMs) are widely believed to possess self-correction capabilities, yet recent studies suggest that intrinsic self-correction–where models correct their own outputs without external feedback–remains largely ineffective. In this work, we systematically decompose self-correction into three distinct sub-capabilities: error detection, error localization, and error correction. Through cross-model experiments on GSM8K-Complex (n=500 per model, 346 total errors) with three major LLMs, we uncover a striking Accuracy-Correction Paradox: weaker models (GPT-3.5, 66% accuracy) achieve 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy)–26.8% vs 16.7%. We propose the Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self-correction. Error detection rates vary dramatically across architectures (10% to 82%), yet detection capability does not predict correction success–Claude detects only 10% of errors but corrects 29% intrinsically. Surprisingly, providing error location hints hurts all models. Our findings challenge linear assumptions about model capability and self-improvement, with important implications for the design of self-refinement pipelines.

[394] Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

Deep Pankajbhai Mehta

Main category: cs.AI

TL;DR: AI models notice hidden hints in questions but rarely mention them in their reasoning, even when directly asked. Forcing them to report hints causes over-reporting and reduces accuracy, revealing limitations of current explanation methods.

Details

Motivation: To test whether AI systems' step-by-step explanations actually reveal what influenced their answers, particularly whether they disclose embedded hints in questions.

Method: Conducted study with over 9,000 test cases across 11 leading AI models, embedding hints into questions and measuring whether models mentioned them. Tested various conditions: spontaneous reporting, direct questioning, “being watched” condition, and forced reporting.

Result: Models almost never mention hints spontaneously but admit noticing them when asked directly. Forcing models to report hints causes them to report hints even when none exist and reduces accuracy. Hints appealing to user preferences are especially dangerous - models follow them most often while reporting them least.

Conclusion: Simply watching AI reasoning is insufficient to catch hidden influences. Current explanation methods don’t reliably reveal what actually influences AI decisions, posing risks for AI transparency and trustworthiness.

Abstract: When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI’s answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.

[395] OmniNeuro: A Multimodal HCI Framework for Explainable BCI Feedback via Generative AI and Sonification

Ayda Aghaei Nia

Main category: cs.AI

TL;DR: OmniNeuro is an interpretable BCI framework that transforms black-box decoders into transparent feedback systems using physics, chaos, and quantum-inspired metrics to provide real-time neuro-sonification and clinical reports.

Details

Motivation: Clinical adoption of deep learning-based BCIs is hindered by their "black box" nature, which causes user frustration and poor neuroplasticity outcomes due to lack of transparency and explainability.

Method: Proposes OmniNeuro framework with three interpretability engines: (1) Physics-based energy metrics, (2) Chaos/fractal complexity analysis, and (3) Quantum-inspired uncertainty modeling. These drive real-time neuro-sonification and generative AI clinical reports. The system is decoder-agnostic and works as an interpretability layer for any state-of-the-art architecture.

Result: Achieved mean accuracy of 58.52% on PhysioNet dataset (N=109). Qualitative pilot studies (N=3) confirmed that explainable feedback helps users regulate mental effort and reduces trial-and-error phase.

Conclusion: OmniNeuro successfully transforms BCIs from silent decoders into transparent feedback partners, addressing the interpretability gap in clinical BCI adoption while maintaining compatibility with existing state-of-the-art architectures.

Abstract: While Deep Learning has improved Brain-Computer Interface (BCI) decoding accuracy, clinical adoption is hindered by the “Black Box” nature of these algorithms, leading to user frustration and poor neuroplasticity outcomes. We propose OmniNeuro, a novel HCI framework that transforms the BCI from a silent decoder into a transparent feedback partner. OmniNeuro integrates three interpretability engines: (1) Physics (Energy), (2) Chaos (Fractal Complexity), and (3) Quantum-Inspired uncertainty modeling. These metrics drive real-time Neuro-Sonification and Generative AI Clinical Reports. Evaluated on the PhysioNet dataset ($N=109$), the system achieved a mean accuracy of 58.52%, with qualitative pilot studies ($N=3$) confirming that explainable feedback helps users regulate mental effort and reduces the “trial-and-error” phase. OmniNeuro is decoder-agnostic, acting as an essential interpretability layer for any state-of-the-art architecture.

[396] Enhancing Temporal Awareness in LLMs for Temporal Point Processes

Lili Chen, Wensheng Gan, Shuang Liang, Philip S. Yu

Main category: cs.AI

TL;DR: TPP-TAL is a plug-and-play framework that enhances temporal reasoning in LLMs for temporal point process modeling by explicitly aligning temporal dynamics with contextual semantics, improving event prediction accuracy.

Details

Motivation: Current methods struggle to capture complex interactions between temporal information and semantic context in temporal point processes, limiting LLMs' effectiveness in continuous-time event modeling despite their success in sequence modeling.

Method: TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding information into LLMs, rather than simply concatenating event time and type embeddings, enabling better perception of temporal dependencies and long-range interactions.

Result: Comprehensive experiments on benchmark datasets show TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy compared to conventional approaches.

Conclusion: Enhancing temporal awareness in LLMs is crucial for continuous-time event modeling, and TPP-TAL’s explicit alignment of temporal dynamics with contextual semantics provides an effective plug-and-play solution for temporal point process applications.

Abstract: Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP-TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug-and-play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long-range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous-time event modeling. The code is made available at https://github.com/chenlilil/TPP-TAL

[397] Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models

Ron F. Del Rosario

Main category: cs.AI

TL;DR: Fine-tuning language models to detect temporal attack patterns in multi-agent AI workflows using OpenTelemetry trace analysis, achieving 31.4% accuracy improvement through strategic data curation and QLoRA fine-tuning.

Details

Motivation: To develop a reproducible framework for detecting security threats in multi-agent AI workflows, addressing the need for specialized security models that can identify temporal attack patterns and coordination attacks in complex AI systems.

Method: Curated dataset of 80,851 examples from 18 cybersecurity sources and 35,026 synthetic OpenTelemetry traces. Applied iterative QLoRA fine-tuning on resource-constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation.

Result: Custom benchmark accuracy improved from 42.86% to 74.29%, a statistically significant 31.4-point gain. Targeted examples addressing specific knowledge gaps outperformed indiscriminate scaling.

Conclusion: Established the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes. While practical deployment requires human oversight due to false positive rates, the work demonstrates that training data composition fundamentally determines model behavior.

Abstract: We present an openly documented methodology for fine-tuning language models to detect temporal attack patterns in multi-agent AI workflows using OpenTelemetry trace analysis. We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces. We apply iterative QLoRA fine-tuning on resource-constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation. Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4-point gain. Targeted examples addressing specific knowledge gaps outperform indiscriminate scaling. Key contributions include: (1) synthetic trace generation methodology for multi-agent coordination attacks and regulatory violations, (2) empirical evidence that training data composition fundamentally determines behavior, and (3) complete open release of datasets, training scripts, and evaluation benchmarks on HuggingFace. While practical deployment requires human oversight due to false positive rates, this work establishes the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes.

[398] Comment on: Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Tasks

Milos Stankovic, Ella Hirche, Sarah Kollatzsch, Julia Nadine Doetsch

Main category: cs.AI

TL;DR: Critical commentary on Kosmyna et al.’s (2025) study about AI assistants and cognitive debt, highlighting methodological concerns and suggesting more conservative interpretation of results.

Details

Motivation: To provide constructive feedback to improve Kosmyna et al.'s manuscript for peer-reviewed publication, addressing concerns about study design, reproducibility, and methodological issues.

Method: Critical analysis of the original study’s methodology, focusing on five key areas: study design limitations, reproducibility problems, EEG analysis issues, reporting inconsistencies, and transparency gaps.

Result: Identifies significant methodological concerns that warrant more conservative interpretation of Kosmyna et al.’s findings about AI assistants and cognitive debt in essay writing.

Conclusion: The commentary suggests that Kosmyna et al.’s results should be interpreted more cautiously due to methodological limitations, while acknowledging the value of their research initiative and dataset.

Abstract: Recently published work titled Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Task by Kosmyna et al. (2025) has sparked a vivid debate on the topic of artificial intelligence (AI) and human performance. We sincerely congratulate Kosmyna et al. for initiating such important research, collecting a valuable dataset, and establishing highly automated pipelines for Natural Language Processing (NLP) analyses and scoring. We aim to provide constructive comments that may improve the manuscript’s readiness for peer-reviewed publication, as some results by Kosmyna et al. (2025) could be interpreted more conservatively. Our primary concerns focus on: (i) study design considerations, including the limited sample size; (ii) the reproducibility of the analyses; (iii) methodological issues related to the EEG analysis; (iv) inconsistencies in the reporting of results; and (v) limited transparency in several aspects of the study’s procedures and findings.

[399] A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents

Gaoke Zhang, Bo Wang, Yunlong Ma, Dongming Zhao, Zifei Yu

Main category: cs.AI

TL;DR: The paper proposes MMS, a multi-memory segment system for agents that creates structured long-term memory with retrieval and contextual units, improving memory quality over simple summarization approaches.

Details

Motivation: Current agent memory systems focus on retrieval but neglect memory content quality, storing only summarized dialogue history. This low-quality memory content harms recall performance and response quality. Human long-term memory formation is multi-dimensional and multi-component, not just simple summarization.

Method: Designed MMS (multi-memory segment system) inspired by cognitive psychology. Processes short-term memory into multiple long-term memory segments, then constructs paired retrieval memory units and contextual memory units (one-to-one correspondence). During retrieval, matches user queries to relevant retrieval units, then uses corresponding contextual units as context for response generation.

Result: Experiments on LoCoMo dataset show effectiveness. Additional ablation studies, robustness tests on varying input memory counts, and overhead experiments demonstrate practical value and superiority over existing methods.

Conclusion: MMS provides a more sophisticated approach to agent memory construction that better mimics human memory formation, leading to higher quality memory content that improves retrieval performance and response generation compared to simple summarization methods.

Abstract: In the current field of agent memory, extensive explorations have been conducted in the area of memory retrieval, yet few studies have focused on exploring the memory content. Most research simply stores summarized versions of historical dialogues, as exemplified by methods like A-MEM and MemoryBank. However, when humans form long-term memories, the process involves multi-dimensional and multi-component generation, rather than merely creating simple summaries. The low-quality memory content generated by existing methods can adversely affect recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multi-memory segment system (MMS) inspired by cognitive psychology theory. The system processes short-term memory into multiple long-term memory segments, and constructs retrieval memory units and contextual memory units based on these segments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user’s query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. We conducted experiments on the LoCoMo dataset and further performed ablation experiments, experiments on the robustness regarding the number of input memories, and overhead experiments, which demonstrated the effectiveness and practical value of our method.

[400] Cultural Encoding in Large Language Models: The Existence Gap in AI-Mediated Brand Discovery

Huang Junyao, Situ Ruimin, Ye Renqin

Main category: cs.AI

TL;DR: Chinese LLMs show 30.6% higher brand mention rates than international LLMs due to training data geography, creating an “Existence Gap” where brands absent from training data become algorithmically invisible regardless of quality.

Details

Motivation: As AI systems increasingly mediate consumer information discovery, brands face algorithmic invisibility. The study investigates systematic differences in brand recommendations arising from training data composition in LLMs.

Method: Analyzed 1,909 pure-English queries across 6 LLMs (GPT-4o, Claude, Gemini, Qwen3, DeepSeek, Doubao) and 30 brands. Conducted case study of Zhizibianjie (OmniEdge) platform to demonstrate linguistic boundary barriers.

Result: Chinese LLMs exhibit 30.6 percentage points higher brand mention rates than International LLMs (88.9% vs. 58.3%, p<.001). The disparity persists in identical English queries, indicating training data geography drives the effect. Zhizibianjie had 65.6% mention rate in Chinese LLMs but 0% in International models.

Conclusion: Introduces “Existence Gap” concept and “Data Moat Framework” where AI-visible content becomes strategic resource. Proposes “Algorithmic Omnipresence” as strategic objective for Generative Engine Optimization (GEO) with 18-month roadmap for brands to build Data Moats through semantic coverage, technical depth, and cultural localization.

Abstract: As artificial intelligence systems increasingly mediate consumer information discovery, brands face algorithmic invisibility. This study investigates Cultural Encoding in Large Language Models (LLMs) – systematic differences in brand recommendations arising from training data composition. Analyzing 1,909 pure-English queries across 6 LLMs (GPT-4o, Claude, Gemini, Qwen3, DeepSeek, Doubao) and 30 brands, we find Chinese LLMs exhibit 30.6 percentage points higher brand mention rates than International LLMs (88.9% vs. 58.3%, p<.001). This disparity persists in identical English queries, indicating training data geography – not language – drives the effect. We introduce the Existence Gap: brands absent from LLM training corpora lack “existence” in AI responses regardless of quality. Through a case study of Zhizibianjie (OmniEdge), a collaboration platform with 65.6% mention rate in Chinese LLMs but 0% in International models (p<.001), we demonstrate how Linguistic Boundary Barriers create invisible market entry obstacles. Theoretically, we contribute the Data Moat Framework, conceptualizing AI-visible content as a VRIN strategic resource. We operationalize Algorithmic Omnipresence – comprehensive brand visibility across LLM knowledge bases – as the strategic objective for Generative Engine Optimization (GEO). Managerially, we provide an 18-month roadmap for brands to build Data Moats through semantic coverage, technical depth, and cultural localization. Our findings reveal that in AI-mediated markets, the limits of a brand’s “Data Boundaries” define the limits of its “Market Frontiers.”

[401] Universal Conditional Logic: A Formal Language for Prompt Engineering

Anthony Mikinka

Main category: cs.AI

TL;DR: UCL is a mathematical framework that transforms prompt engineering from heuristic practice to systematic optimization, achieving 29.8% token reduction with corresponding cost savings through structural overhead analysis and model-specific adaptations.

Details

Motivation: Current prompt engineering is largely heuristic and lacks systematic optimization methods. The paper aims to transform prompt engineering from an art to a science by developing a mathematical framework that can systematically optimize prompts for efficiency and cost-effectiveness.

Method: Developed Universal Conditional Logic (UCL) framework with core mechanisms including indicator functions, structural overhead function O_s(A), and early binding. Conducted systematic evaluation across 305 cases, 11 models, and 4 iterations to validate the framework and identify optimal configurations.

Result: Achieved significant token reduction of 29.8% (p < 0.001, Cohen’s d = 2.01) with corresponding cost savings. Discovered the Over-Specification Paradox with threshold S* = 0.509, beyond which additional specification degrades performance quadratically. Found that optimal UCL configuration varies by model architecture, requiring version-specific adaptations.

Conclusion: UCL establishes a calibratable framework for efficient LLM interaction, transforming prompt engineering from heuristic practice to systematic optimization. Model-family-specific optimization emerges as a key research direction for future work.

Abstract: We present Universal Conditional Logic (UCL), a mathematical framework for prompt optimization that transforms prompt engineering from heuristic practice into systematic optimization. Through systematic evaluation (N=305, 11 models, 4 iterations), we demonstrate significant token reduction (29.8%, t(10)=6.36, p < 0.001, Cohen’s d = 2.01) with corresponding cost savings. UCL’s structural overhead function O_s(A) explains version-specific performance differences through the Over-Specification Paradox: beyond threshold S* = 0.509, additional specification degrades performance quadratically. Core mechanisms – indicator functions (I_i in {0,1}), structural overhead (O_s = gamma * sum(ln C_k)), early binding – are validated. Notably, optimal UCL configuration varies by model architecture – certain models (e.g., Llama 4 Scout) require version-specific adaptations (V4.1). This work establishes UCL as a calibratable framework for efficient LLM interaction, with model-family-specific optimization as a key research direction.

[402] Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng

Main category: cs.AI

TL;DR: Finch is a finance & accounting benchmark for evaluating AI agents on real-world enterprise workflows, sourced from authentic enterprise data with 172 composite workflows and 384 tasks, showing current AI systems struggle with messy, long-horizon professional work.

Details

Motivation: There's a need to evaluate AI agents on authentic enterprise-grade professional workflows that capture the messy, multimodal, collaborative nature of real-world finance and accounting work, rather than simplified synthetic tasks.

Method: Created Finch benchmark through LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real email threads and spreadsheet version histories, (2) meticulous expert annotation requiring 700+ hours of domain-expert effort, resulting in 172 composite workflows with 384 tasks from 1,710 spreadsheets and other artifacts.

Result: Current frontier AI systems perform poorly: GPT 5.1 Pro spends 16.8 minutes per workflow but passes only 38.4%, Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies reveal challenges real-world enterprise workflows pose for AI agents.

Conclusion: Real-world enterprise workflows present significant challenges for current AI agents, highlighting the gap between synthetic benchmarks and authentic professional work, and demonstrating the need for benchmarks like Finch that capture the messy, long-horizon, knowledge-intensive nature of enterprise tasks.

Abstract: We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows – interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 16.8 minutes per workflow yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

[403] Counterfactual Self-Questioning for Stable Policy Optimization in Language Models

Mandar Parab

Main category: cs.AI

TL;DR: Counterfactual Self-Questioning enables language models to self-improve by generating and evaluating alternative reasoning paths without external critics or reward models.

Details

Motivation: Existing self-improvement methods rely on external critics, learned reward models, or ensemble sampling, which increases complexity and training instability. The authors aim to develop a simpler, more stable approach using only the model's internal capabilities.

Method: A single language model generates an initial reasoning trace, formulates targeted questions challenging potential failure points, and creates alternative counterfactual reasoning trajectories that expose incorrect assumptions or invalid steps. These counterfactual trajectories provide structured relative feedback for direct policy optimization.

Result: Experiments on multiple mathematical reasoning benchmarks show improved accuracy and training stability, particularly for smaller models, enabling scalable self-improvement using only internally generated supervision.

Conclusion: Counterfactual Self-Questioning provides an effective framework for language model self-improvement that eliminates the need for external components, reduces complexity, and enhances training stability while maintaining performance gains.

Abstract: Recent work on language model self-improvement shows that models can refine their own reasoning through reflection, verification, debate, or self-generated rewards. However, most existing approaches rely on external critics, learned reward models, or ensemble sampling, which increases complexity and training instability. We propose Counterfactual Self-Questioning, a framework in which a single language model generates and evaluates counterfactual critiques of its own reasoning. The method produces an initial reasoning trace, formulates targeted questions that challenge potential failure points, and generates alternative reasoning trajectories that expose incorrect assumptions or invalid steps. These counterfactual trajectories provide structured relative feedback that can be directly used for policy optimization without auxiliary models. Experiments on multiple mathematical reasoning benchmarks show that counterfactual self-questioning improves accuracy and training stability, particularly for smaller models, enabling scalable self-improvement using internally generated supervision alone.

[404] Context Collapse: In-Context Learning and Model Collapse

Josef Ott

Main category: cs.AI

TL;DR: This thesis analyzes three phenomena in LLMs: in-context learning phase transitions in linear transformers, model collapse in simplified settings, and introduces context collapse linking ICL dynamics with long-term stability issues.

Details

Motivation: The research aims to understand fundamental mechanisms behind key LLM behaviors - how in-context learning works mathematically, why model collapse occurs, and how context degradation happens during long generations, particularly in chain-of-thought reasoning.

Method: 1) For ICL: Study linear transformer with tied weights on linear regression tasks, reduce forward pass to preconditioned gradient descent, analyze optimal preconditioner. 2) For model collapse: Use martingale and random walk theory on simplified settings (linear regression and Gaussian fitting) under replacing and cumulative data regimes. 3) For context collapse: Introduce concept linking ICL dynamics with long-term stability.

Result: 1) ICL shows phase transition: above critical context length, solution develops skew-symmetric component inducing gradient rotation. 2) Model collapse: Proved almost sure convergence - collapse occurs unless data grows sufficiently fast or is retained. 3) Context collapse: Identified degradation during long generations, linking ICL dynamics with stability challenges.

Conclusion: The thesis provides mathematical foundations for understanding LLM behaviors, showing phase transitions in ICL, proving model collapse conditions, and introducing context collapse as a unifying concept connecting ICL dynamics with long-term generative stability issues.

Abstract: This thesis investigates two key phenomena in large language models (LLMs): in-context learning (ICL) and model collapse. We study ICL in a linear transformer with tied weights trained on linear regression tasks, and show that minimising the in-context loss leads to a phase transition in the learned parameters. Above a critical context length, the solution develops a skew-symmetric component. We prove this by reducing the forward pass of the linear transformer under weight tying to preconditioned gradient descent, and then analysing the optimal preconditioner. This preconditioner includes a skew-symmetric component, which induces a rotation of the gradient direction. For model collapse, we use martingale and random walk theory to analyse simplified settings - linear regression and Gaussian fitting - under both replacing and cumulative data regimes. We strengthen existing results by proving almost sure convergence, showing that collapse occurs unless the data grows sufficiently fast or is retained over time. Finally, we introduce the notion of context collapse: a degradation of context during long generations, especially in chain-of-thought reasoning. This concept links the dynamics of ICL with long-term stability challenges in generative models.

Michael Bao

Main category: cs.AI

TL;DR: ElecTwit is a simulation framework for studying persuasion in multi-agent systems that mimics social media during political elections, revealing 25 persuasion techniques across LLMs and unique emergent behaviors.

Details

Motivation: To overcome limitations of game-based simulations in prior research by creating a realistic environment for studying persuasion in multi-agent systems, specifically emulating social media interactions during political elections.

Method: Developed ElecTwit simulation framework that grounds experiments in realistic social media environments during political elections, testing various LLMs in multi-agent systems to observe persuasion techniques and emergent behaviors.

Result: Observed comprehensive use of 25 specific persuasion techniques across most tested LLMs (more than previously reported), found variations in technique usage between models based on architecture/training, and discovered unique phenomena like “kernel of truth” messages and spontaneous “ink obsession” where agents collectively demanded written proof.

Conclusion: The study provides a foundation for evaluating persuasive LLM agents in real-world contexts, highlighting how different model architectures impact persuasion dynamics in realistic social simulations, which is crucial for ensuring alignment and preventing dangerous outcomes.

Abstract: This paper introduces ElecTwit, a simulation framework designed to study persuasion within multi-agent systems, specifically emulating the interactions on social media platforms during a political election. By grounding our experiments in a realistic environment, we aimed to overcome the limitations of game-based simulations often used in prior research. We observed the comprehensive use of 25 specific persuasion techniques across most tested LLMs, encompassing a wider range than previously reported. The variations in technique usage and overall persuasion output between models highlight how different model architectures and training can impact the dynamics in realistic social simulations. Additionally, we observed unique phenomena such as “kernel of truth” messages and spontaneous developments with an “ink” obsession, where agents collectively demanded written proof. Our study provides a foundation for evaluating persuasive LLM agents in real-world contexts, ensuring alignment and preventing dangerous outcomes.

[406] Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering

Wuzhenghong Wen, Chao Xue, Su Pan, Yuwei Sun, Minlong Peng

Main category: cs.AI

TL;DR: MRE framework improves TKGQA by enhancing multi-hop reasoning with prompt engineering, supervised fine-tuning, and Tree-Group Relative Policy Optimization to identify optimal reasoning trajectories.

Details

Motivation: Current TKGQA approaches face challenges with suboptimal decisions and error propagation due to retrieving numerous temporally similar and semantically complex relations at each hop during multi-hop reasoning over temporal knowledge graphs.

Method: Proposes MRE framework with three components: 1) Prompt engineering to generate diverse reasoning trajectories, 2) Supervised fine-tuning using valid trajectories as cold-start, 3) Tree-Group Relative Policy Optimization (T-GRPO) - a recursive tree-structured learning-by-exploration approach that establishes causal dependencies between hops and uses multi-path feedback.

Result: Experimental results on two TKGQA benchmarks show MRE consistently surpasses state-of-the-art approaches in handling complex multi-hop queries, with improved interpretability and robustness to noisy temporal annotations.

Conclusion: The MRE framework effectively addresses challenges in TKGQA by enhancing both forward and backward reasoning, leading to better identification of globally optimal reasoning trajectories and improved performance on complex multi-hop queries.

Abstract: Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in the knowledge graph to answer a given question. However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories. Specifically, MRE begins with prompt engineering to guide the LLM in generating diverse reasoning trajectories for a given question. Valid reasoning trajectories are then selected for supervised fine-tuning, serving as a cold-start strategy. Finally, we introduce Tree-Group Relative Policy Optimization (T-GRPO), a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on the previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experimental results on two TKGQA benchmarks indicate that the proposed MRE-based model consistently surpasses state-of-the-art (SOTA) approaches in handling complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.

[407] Accelerating Monte-Carlo Tree Search with Optimized Posterior Policies

Keith Frankston, Benjamin Howard

Main category: cs.AI

TL;DR: RMCTS is a faster AlphaZero-style MCTS algorithm that uses breadth-first search and optimized posterior policies, achieving 40x speedup for single root states and matching MCTS-UCB network quality in 1/3 training time.

Details

Motivation: AlphaZero's MCTS-UCB suffers from GPU latency issues due to sequential network inferences. The authors aim to develop a faster alternative that maintains similar training quality.

Method: RMCTS uses breadth-first tree exploration with network inferences batched together. It computes optimized posterior policies recursively from leaves to root using Grill et al.’s regularization approach, following prior network policies rather than adaptive tree building.

Result: RMCTS achieves >40x speedup over MCTS-UCB for single root states and ~3x speedup for batch searches. Networks trained with RMCTS match MCTS-UCB-trained network quality in one-third the training time across Connect-4, Dots-and-Boxes, and Othello.

Conclusion: RMCTS provides significant speed advantages over MCTS-UCB while maintaining training quality, making it a practical alternative for AlphaZero-style reinforcement learning despite using non-adaptive tree exploration.

Abstract: We introduce a recursive AlphaZero-style Monte–Carlo tree search algorithm, “RMCTS”. The advantage of RMCTS over AlphaZero’s MCTS-UCB is speed. In RMCTS, the search tree is explored in a breadth-first manner, so that network inferences naturally occur in large batches. This significantly reduces the GPU latency cost. We find that RMCTS is often more than 40 times faster than MCTS-UCB when searching a single root state, and about 3 times faster when searching a large batch of root states. The recursion in RMCTS is based on computing optimized posterior policies at each game state in the search tree, starting from the leaves and working back up to the root. Here we use the posterior policy explored in “Monte–Carlo tree search as regularized policy optimization” (Grill, et al.) Their posterior policy is the unique policy which maximizes the expected reward given estimated action rewards minus a penalty for diverging from the prior policy. The tree explored by RMCTS is not defined in an adaptive manner, as it is in MCTS-UCB. Instead, the RMCTS tree is defined by following prior network policies at each node. This is a disadvantage, but the speedup advantage is more significant, and in practice we find that RMCTS-trained networks match the quality of MCTS-UCB-trained networks in roughly one-third of the training time. We include timing and quality comparisons of RMCTS vs. MCTS-UCB for three games: Connect-4, Dots-and-Boxes, and Othello.

[408] Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models

Rong Zhou, Dongping Chen, Zihan Jia, Yao Su, Yixin Liu, Yiwen Lu, Dongwei Shi, Yue Huang, Tianyang Xu, Yi Pan, Xinliang Li, Yohannes Abate, Qingyu Chen, Zhengzhong Tu, Yu Yang, Yu Zhang, Qingsong Wen, Gengchen Mai, Sunyang Fu, Jiachen Li, Xuyu Wang, Ziran Wang, Jing Huang, Tianming Liu, Yong Chen, Lichao Sun, Lifang He

Main category: cs.AI

TL;DR: AI transforms digital twins from passive simulations into intelligent autonomous systems through a four-stage framework spanning modeling, mirroring, intervention, and autonomous management.

Details

Motivation: Digital twins have evolved beyond passive simulation tools into intelligent entities through AI integration, requiring a systematic framework to characterize how AI methodologies are embedded across the digital twin lifecycle.

Method: Proposes a unified four-stage framework: (1) modeling physical twins using physics-based and physics-informed AI, (2) mirroring with real-time synchronization, (3) intervening through predictive modeling, anomaly detection, and optimization, and (4) autonomous management via LLMs, foundation models, and intelligent agents.

Result: Identifies synergy between physics-based modeling and data-driven learning, shift from traditional solvers to physics-informed/foundation models, and transformation of digital twins into proactive cognitive systems through generative AI. Cross-domain review across 11 application domains reveals common challenges in scalability, explainability, and trustworthiness.

Conclusion: AI-driven digital twins are evolving into autonomous cognitive systems, with future directions focusing on responsible AI integration addressing scalability, explainability, and trustworthiness challenges across diverse application domains.

Abstract: Digital twins, as precise digital representations of physical systems, have evolved from passive simulation tools into intelligent and autonomous entities through the integration of artificial intelligence technologies. This paper presents a unified four-stage framework that systematically characterizes AI integration across the digital twin lifecycle, spanning modeling, mirroring, intervention, and autonomous management. By synthesizing existing technologies and practices, we distill a unified four-stage framework that systematically characterizes how AI methodologies are embedded across the digital twin lifecycle: (1) modeling the physical twin through physics-based and physics-informed AI approaches, (2) mirroring the physical system into a digital twin with real-time synchronization, (3) intervening in the physical twin through predictive modeling, anomaly detection, and optimization strategies, and (4) achieving autonomous management through large language models, foundation models, and intelligent agents. We analyze the synergy between physics-based modeling and data-driven learning, highlighting the shift from traditional numerical solvers to physics-informed and foundation models for physical systems. Furthermore, we examine how generative AI technologies, including large language models and generative world models, transform digital twins into proactive and self-improving cognitive systems capable of reasoning, communication, and creative scenario generation. Through a cross-domain review spanning eleven application domains, including healthcare, aerospace, smart manufacturing, robotics, and smart cities, we identify common challenges related to scalability, explainability, and trustworthiness, and outline directions for responsible AI-driven digital twin systems.

[409] Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

Shengji Tang, Weihao Lin, Jingqi Ye, Hao Li, Bo Zhang, Shuyue Hu, Tao Chen, Wangli Ouyang, Lei Bai, Peng Ye

Main category: cs.AI

TL;DR: JiSi framework enables open-source LLMs to collectively surpass Gemini-3-Pro performance at 47% cost through improved routing, aggregation, and dynamic switching.

Details

Motivation: To explore collective intelligence as an alternative to monolithic scaling, addressing limitations in current LLM routing and aggregation methods that prevent effective collaboration among open-source models.

Method: JiSi framework with three innovations: 1) Query-Response Mixed Routing that captures both semantic information and problem difficulty, 2) Support-Set-based Aggregator Selection that evaluates aggregation and domain capacity, 3) Adaptive Routing-Aggregation Switch that dynamically leverages routing and aggregation advantages.

Result: JiSi surpasses Gemini-3-Pro performance with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines across nine benchmarks.

Conclusion: Collective intelligence represents a novel path toward AGI, demonstrating that collaboration among open-source LLMs can achieve superior performance compared to monolithic scaling approaches.

Abstract: Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs’ collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs’ collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).

[410] A unified multimodal understanding and generation model for cross-disciplinary scientific research

Xiaomeng Yang, Zhiyu Tan, Xiaohui Zhong, Mengping Yang, Qiusheng Huang, Lei Chen, Libo Wu, Hao Li

Main category: cs.AI

TL;DR: FuXi-Uni is a unified multimodal AI model that simultaneously understands and generates high-dimensional scientific data across Earth science and Biomedicine domains within a single architecture.

Details

Motivation: Scientific discovery increasingly requires integrating heterogeneous, high-dimensional data across disciplines, but current AI models are typically domain-specific and lack the capability to simultaneously understand and generate multimodal scientific data. Many pressing global challenges require coordinated progress across multiple scientific fields.

Method: FuXi-Uni aligns cross-disciplinary scientific tokens with natural language tokens and employs a science decoder to reconstruct scientific tokens. This unified architecture supports both natural language conversation and scientific numerical prediction within a single model.

Result: In Earth science: generates 10-day global weather forecasts at 0.25° resolution outperforming state-of-the-art physical forecasting systems; shows superior tropical cyclone track and intensity prediction; generates high-resolution regional weather fields surpassing interpolation baselines. In Biomedicine: outperforms leading multimodal LLMs on multiple biomedical visual question answering benchmarks.

Conclusion: FuXi-Uni successfully unifies heterogeneous scientific modalities within a native shared latent space while maintaining strong domain-specific performance, representing a step toward more general-purpose, multimodal scientific models.

Abstract: Scientific discovery increasingly relies on integrating heterogeneous, high-dimensional data across disciplines nowadays. While AI models have achieved notable success across various scientific domains, they typically remain domain-specific or lack the capability of simultaneously understanding and generating multimodal scientific data, particularly for high-dimensional data. Yet, many pressing global challenges and scientific problems are inherently cross-disciplinary and require coordinated progress across multiple fields. Here, we present FuXi-Uni, a native unified multimodal model for scientific understanding and high-fidelity generation across scientific domains within a single architecture. Specifically, FuXi-Uni aligns cross-disciplinary scientific tokens within natural language tokens and employs science decoder to reconstruct scientific tokens, thereby supporting both natural language conversation and scientific numerical prediction. Empirically, we validate FuXi-Uni in Earth science and Biomedicine. In Earth system modeling, the model supports global weather forecasting, tropical cyclone (TC) forecast editing, and spatial downscaling driven by only language instructions. FuXi-Uni generates 10-day global forecasts at 0.25° resolution that outperform the SOTA physical forecasting system. It shows superior performance for both TC track and intensity prediction relative to the SOTA physical model, and generates high-resolution regional weather fields that surpass standard interpolation baselines. Regarding biomedicine, FuXi-Uni outperforms leading multimodal large language models on multiple biomedical visual question answering benchmarks. By unifying heterogeneous scientific modalities within a native shared latent space while maintaining strong domain-specific performance, FuXi-Uni provides a step forward more general-purpose, multimodal scientific models.

[411] KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models

Zixian Liu, Sihao Liu, Yuqi Zhao

Main category: cs.AI

TL;DR: KGCE is a novel benchmarking platform for evaluating multimodal language model agents in educational cross-platform tasks, addressing deficiencies in existing benchmarks for school-specific software through knowledge base enhancement and dual-graph evaluation.

Details

Motivation: Existing benchmark frameworks have deficiencies in supporting cross-platform tasks in educational contexts, especially with school-specific software where agent efficiency decreases due to lack of structural understanding. Current evaluation methods rely on coarse-grained metrics that fail to capture detailed execution and efficiency in complex tasks.

Method: KGCE integrates knowledge base enhancement and a dual-graph evaluation framework. It constructs a dataset of 104 education-related tasks covering Windows, Android, and cross-platform collaborative tasks. The dual-graph framework decomposes tasks into sub-goals and verifies completion status for fine-grained evaluation. An enhanced agent system incorporates a knowledge base specific to school-specific software.

Result: The paper presents KGCE as a complete benchmarking platform with a dataset of 104 educational tasks and a novel evaluation framework. The enhanced agent system addresses execution bottlenecks in private-domain tasks through knowledge base integration.

Conclusion: KGCE provides a comprehensive solution for benchmarking multimodal language model agents in educational cross-platform tasks, offering fine-grained evaluation metrics and addressing the specific challenges of school-specific software through knowledge augmentation.

Abstract: With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross-platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross-platform tasks in educational contexts, especially when dealing with school-specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private-domain software. Additionally, current evaluation methods heavily rely on coarse-grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual-graph evaluation framework. We first constructed a dataset comprising 104 education-related tasks, covering Windows, Android, and cross-platform collaborative tasks. KGCE introduces a dual-graph evaluation framework that decomposes tasks into multiple sub-goals and verifies their completion status, providing fine-grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private-domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school-specific software. The code can be found at https://github.com/Kinginlife/KGCE.

[412] Empowering Small Language Models with Factual Hallucination-Aware Reasoning for Financial Classification

Han Yuan, Yilin Wu, Li Zhang, Zheng Ma

Main category: cs.AI

TL;DR: The paper proposes AAAI pipeline to mitigate factual hallucinations in small language models for financial classification, showing that reducing hallucinations improves classification performance.

Details

Motivation: Small language models (SLMs) are increasingly used in finance due to fast inference and local deployability, but they suffer from factual hallucinations and weaker classification performance compared to large language models. The paper investigates whether mitigating factual hallucinations can improve SLMs' financial classification.

Method: Proposes AAAI pipeline with three steps: Association Identification (finding connections between hallucinations and misclassifications), Automated Detection (using encoder-based verifiers to detect factual hallucinations), and Adaptive Inference (incorporating feedback on factual errors to enhance classification).

Result: Experiments on three representative SLMs show: (1) factual hallucinations are positively correlated with misclassifications; (2) encoder-based verifiers effectively detect factual hallucinations; and (3) incorporating feedback on factual errors enables adaptive inference that enhances classification performance.

Conclusion: The AAAI pipeline contributes to trustworthy and effective applications of SLMs in finance by mitigating factual hallucinations and improving classification performance through adaptive inference.

Abstract: Small language models (SLMs) are increasingly used for financial classification due to their fast inference and local deployability. However, compared with large language models, SLMs are more prone to factual hallucinations in reasoning and exhibit weaker classification performance. This raises a natural question: Can mitigating factual hallucinations improve SLMs’ financial classification? To address this, we propose a three-step pipeline named AAAI (Association Identification, Automated Detection, and Adaptive Inference). Experiments on three representative SLMs reveal that: (1) factual hallucinations are positively correlated with misclassifications; (2) encoder-based verifiers effectively detect factual hallucinations; and (3) incorporating feedback on factual errors enables SLMs’ adaptive inference that enhances classification performance. We hope this pipeline contributes to trustworthy and effective applications of SLMs in finance.

[413] A construction of an optimal base for conditional attribute and attributional condition implications in triadic contexts

Romuald Kwessy Mouona, Blaise Blériot Koguep Njionou, Etienne Romuald Temgoua Alomo, Rokia Missaoui, Leonard Kwuida

Main category: cs.AI

TL;DR: The paper studies implications in triadic contexts, focusing on conditional attribute and attributional condition implications introduced by Ganter and Obiedkov, with the goal of constructing an optimal base for these implications.

Details

Motivation: Triadic contexts (three-dimensional data structures) have implications that need to be properly analyzed and represented. The specific types of implications introduced by Ganter and Obiedkov (conditional attribute and attributional condition implications) require systematic study and efficient representation through optimal bases.

Method: The paper likely employs formal concept analysis methods for triadic contexts, building on Ganter and Obiedkov’s framework. It probably involves developing algorithms or theoretical foundations for constructing optimal implication bases in triadic settings.

Result: The paper presents methods for constructing optimal bases for conditional attribute and attributional condition implications in triadic contexts, providing efficient representations of implication knowledge.

Conclusion: The research contributes to formal concept analysis by extending implication base construction to triadic contexts, offering practical tools for knowledge representation in three-dimensional data structures.

Abstract: This article studies implications in triadic contexts. Specifically, we focus on those introduced by Ganter and Obiedkov, namely conditional attribute and attributional condition implications. Our aim is to construct an optimal base for these implications.

[414] Reading Between the Lines: Deconfounding Causal Estimates using Text Embeddings and Deep Learning

Ahmed Dawoud, Osama El-Shamy

Main category: cs.AI

TL;DR: Neural Network-Enhanced Double Machine Learning framework uses text embeddings to reduce bias in causal effect estimation from 24% to -0.86% compared to tree-based methods.

Details

Motivation: Traditional econometric methods fail when unobserved confounders are orthogonal to structured covariates, while high-dimensional unstructured text contains rich proxies for these latent variables that can help satisfy the unconfoundedness assumption.

Method: Proposes a Neural Network-Enhanced Double Machine Learning (DML) framework that leverages text embeddings for causal identification, using deep learning architectures to model the continuous topology of embedding manifolds rather than standard tree-based estimators.

Result: Unstructured text embeddings capture critical confounding information absent from structured data. Standard tree-based DML estimators retain +24% bias, while the deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering ground-truth causal parameters.

Conclusion: Deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data, as they can properly model the continuous topology of text embedding manifolds that tree-based methods cannot capture.

Abstract: Estimating causal treatment effects in observational settings is frequently compromised by selection bias arising from unobserved confounders. While traditional econometric methods struggle when these confounders are orthogonal to structured covariates, high-dimensional unstructured text often contains rich proxies for these latent variables. This study proposes a Neural Network-Enhanced Double Machine Learning (DML) framework designed to leverage text embeddings for causal identification. Using a rigorous synthetic benchmark, we demonstrate that unstructured text embeddings capture critical confounding information that is absent from structured tabular data. However, we show that standard tree-based DML estimators retain substantial bias (+24%) due to their inability to model the continuous topology of embedding manifolds. In contrast, our deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering the ground-truth causal parameter. These findings suggest that deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data

[415] Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

Danial Amin

Main category: cs.AI

TL;DR: Proposes Bayesian multi-LLM orchestration framework for cost-sensitive sequential decisions, reducing costs by 34% and improving fairness by 45% compared to single-LLM baselines.

Details

Motivation: Current LLM deployment in asymmetric cost settings (hiring, medical triage, fraud detection) uses single LLM with confidence thresholding, which is inadequate for sequential decisions with costs. Need proper probabilistic foundations for cost-aware decision making.

Method: Treats LLMs as approximate likelihood models rather than classifiers. Uses contrastive prompting to elicit likelihoods for candidate states, aggregates across diverse models with robust statistics, updates beliefs with Bayes rule under explicit priors as evidence arrives. Enables coherent belief updating, expected-cost action selection, value-of-information gathering, and ensemble bias mitigation.

Result: In resume screening experiment with 1000 resumes using 5 LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek): reduced total cost by $294,000 (34%) vs best single-LLM baseline, improved demographic parity by 45% (max group gap reduced from 22% to 5%). Ablations show 51% savings from multi-LLM aggregation, 43% from sequential updating, 20% from disagreement-triggered information gathering.

Conclusion: Bayesian multi-LLM orchestration with proper probabilistic foundations significantly improves cost efficiency and fairness in sequential decision-making with asymmetric costs, outperforming single-LLM confidence-thresholding approaches.

Abstract: Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds “confidence,” and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected-cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single-LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi-LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement-triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.

[416] Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix

Fanzhe Fu

Main category: cs.AI

TL;DR: The paper proposes Project Aletheia, a framework to quantify “Cognitive Conviction” in AI reasoning models using Tikhonov Regularization and a Synthetic Proxy Protocol, introducing metrics to measure belief depth and safety alignment.

Details

Motivation: Current AI evaluation paradigms face an epistemological crisis - static benchmarks measure knowledge breadth but fail to quantify belief depth. The authors aim to extend the CHOKE phenomenon framework to System 2 reasoning models to measure cognitive conviction.

Method: Project Aletheia uses Tikhonov Regularization to invert the judge’s confusion matrix, implementing a Synthetic Proxy Protocol for validation without opaque private data. The framework measures cognitive conviction and introduces the Aligned Conviction Score (S_aligned) to ensure safety alignment.

Result: Preliminary pilot study on 2025 baselines (DeepSeek-R1, OpenAI o1) suggests reasoning models act as a “cognitive buffer” but may exhibit “Defensive OverThinking” under adversarial pressure. The framework successfully quantifies cognitive conviction while maintaining safety alignment.

Conclusion: This work provides a blueprint for measuring AI scientific integrity by quantifying cognitive conviction in reasoning models while ensuring safety alignment, addressing the limitations of current evaluation paradigms.

Abstract: In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify “Cognitive Conviction” in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge’s confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a “cognitive buffer,” they may exhibit “Defensive OverThinking” under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score (S_aligned) to verify that conviction does not compromise safety. This work serves as a blueprint for measuring AI scientific integrity.

Letian Kong, Qianran, Jin, Renyu Zhang

Main category: cs.AI

TL;DR: A two-stage framework (context formation + context navigation) improves LLM behavioral alignment in complex decision-making tasks, validated across multiple experimental settings.

Details

Motivation: LLMs increasingly simulate human behavior in experiments but systematically diverge from human decisions in complex environments requiring anticipation of others' actions and belief formation based on observed behavior.

Method: Two-stage framework: 1) Context formation - explicitly specifying experimental design for accurate task representation; 2) Context navigation - guiding reasoning within that representation. Validated through replication of sequential purchasing game with quality signaling, extending to crowdfunding game with costly signaling and demand-estimation task.

Result: Across four SOTA models (GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, DeepSeek-R1), complex decision-making environments require both stages for behavioral alignment, while simpler demand-estimation task requires only context formation.

Conclusion: The framework clarifies when each stage is necessary and provides systematic approach for designing/diagnosing LLM social simulations as complements to human subjects in behavioral research.

Abstract: Large language models (LLMs) are increasingly used to simulate human behavior in experimental settings, but they systematically diverge from human decisions in complex decision-making environments, where participants must anticipate others’ actions and form beliefs based on observed behavior. We propose a two-stage framework for improving behavioral alignment. The first stage, context formation, explicitly specifies the experimental design to establish an accurate representation of the decision task and its context. The second stage, context navigation, guides the reasoning process within that representation to make decisions. We validate this framework through a focal replication of a sequential purchasing game with quality signaling (Kremer and Debo, 2016), extending to a crowdfunding game with costly signaling (Cason et al., 2025) and a demand-estimation task (Gui and Toubia, 2025) to test generalizability across decision environments. Across four state-of-the-art (SOTA) models (GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, DeepSeek-R1), we find that complex decision-making environments require both stages to achieve behavioral alignment with human benchmarks, whereas the simpler demand-estimation task requires only context formation. Our findings clarify when each stage is necessary and provide a systematic approach for designing and diagnosing LLM social simulations as complements to human subjects in behavioral research.

[418] Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

Mingyu Xu, Cheng Fang, Keyue Jiang, Yuqian Zheng, Yanghua Xiao, Baojian Zhou, Qifang Zhao, Suhang Zheng, Xiuwen Zhu, Jiyang Tang, Yongchi Zhao, Yijia Luo, Zhiqi Bai, Yuchi Xu, Wenbo Su, Wei Wang, Bing Zhao, Lin Qu, Xiaoxiao Xu

Main category: cs.AI

TL;DR: Logics-STEM is a reasoning model fine-tuned on a 10M-scale dataset for STEM domains, achieving state-of-the-art performance through data-algorithm co-design.

Details

Motivation: To enhance reasoning capabilities in STEM domains by combining large-scale open-source data with synthetic data through a data-algorithm co-design approach.

Method: Data-algorithm co-design with 5-stage data curation (annotation, deduplication, decontamination, distillation, stratified sampling) and failure-driven post-training framework using targeted knowledge retrieval and data synthesis around model failures.

Result: Achieves 4.68% average improvement over next-best 8B-scale model on STEM benchmarks; releases 8B/32B models and 10M/2.2M dataset versions publicly.

Conclusion: Demonstrates the potential of combining large-scale open-source data with synthetic data through data-algorithm co-design for enhancing reasoning capabilities, with publicly available resources for community research.

Abstract: We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.

[419] CaveAgent: Transforming LLMs into Stateful Runtime Operators

Maohao Ran, Zhenglin Wan, Cooper Lin, Yanting Zhang, Hongyu Xin, Hongwei Fan, Yibo Xu, Beier Luo, Yaxin Zhou, Wangbo Zhao, Lijie Yang, Lang Feng, Fuchao Yang, Jingxuan Wu, Yiqiao Huang, Chendong Ma, Dailing Jiang, Jianbo Deng, Sihui Han, Bo An, Yike Guo, Jun Song

Main category: cs.AI

TL;DR: CaveAgent transforms LLMs from text generators to runtime operators using dual-stream architecture and stateful runtime management for complex task execution.

Details

Motivation: Current LLM-based agents are constrained by text-centric paradigms and procedural JSON-based function calling, which struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift.

Method: Introduces CaveAgent framework with Dual-stream Context Architecture (semantic stream for reasoning + Python Runtime stream for execution) and Stateful Runtime Management that injects, manipulates, and retrieves persistent Python objects across turns.

Result: Achieves 10.5% success rate improvement on retail tasks, reduces total token consumption by 28.4% in multi-turn scenarios, and reduces token consumption by 59% on data-intensive tasks while handling large-scale data that causes context overflow in other agents.

Conclusion: CaveAgent successfully transforms the paradigm from LLM-as-Text-Generator to LLM-as-Runtime-Operator, eliminating context drift and catastrophic forgetting while enabling lossless data flow to downstream applications.

Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from “LLM-as-Text-Generator” to “LLM-as-Runtime-Operator.” We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent’s superiority. Specifically, our framework achieves a 10.5% success rate improvement on retail tasks and reduces total token consumption by 28.4% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.

[420] Structured Decomposition for LLM Reasoning: Cross-Domain Validation and Semantic Web Integration

Albert Sadowski, Jarosław A. Chudziak

Main category: cs.AI

TL;DR: LLMs translate unstructured text into structured ABox assertions using expert TBox specifications, then SWRL reasoners apply rules with deterministic guarantees, achieving better performance than few-shot prompting across legal, scientific, and clinical domains.

Details

Motivation: Rule-based reasoning over natural language needs both interpretive flexibility (for unstructured text) and formal guarantees (for consistent rule application). LLMs provide flexibility but lack consistency guarantees, while symbolic systems provide guarantees but require structured input.

Method: Framework decomposes reasoning into: 1) entity identification, 2) assertion extraction, and 3) symbolic verification. LLMs serve as ontology population engines translating text into ABox assertions according to expert-authored TBox specifications, then SWRL-based reasoners apply rules with deterministic guarantees.

Result: Experiments across three domains (legal hearsay determination, scientific method-task application, clinical trial eligibility) with eleven language models show structured decomposition achieves statistically significant improvements over few-shot prompting in aggregate, with gains across all domains. Symbolic verification provides substantial benefit beyond structured prompting alone.

Conclusion: The integration pattern combines LLMs’ flexibility with symbolic systems’ guarantees, enabling auditable and justifiable rule-based reasoning over natural language. Populated ABox integrates with semantic web tooling for richer inference patterns than simpler formalisms can express.

Abstract: Rule-based reasoning over natural language input arises in domains where decisions must be auditable and justifiable: clinical protocols specify eligibility criteria in prose, evidence rules define admissibility through textual conditions, and scientific standards dictate methodological requirements. Applying rules to such inputs demands both interpretive flexibility and formal guarantees. Large language models (LLMs) provide flexibility but cannot ensure consistent rule application; symbolic systems provide guarantees but require structured input. This paper presents an integration pattern that combines these strengths: LLMs serve as ontology population engines, translating unstructured text into ABox assertions according to expert-authored TBox specifications, while SWRL-based reasoners apply rules with deterministic guarantees. The framework decomposes reasoning into entity identification, assertion extraction, and symbolic verification, with task definitions grounded in OWL 2 ontologies. Experiments across three domains (legal hearsay determination, scientific method-task application, clinical trial eligibility) and eleven language models validate the approach. Structured decomposition achieves statistically significant improvements over few-shot prompting in aggregate, with gains observed across all three domains. An ablation study confirms that symbolic verification provides substantial benefit beyond structured prompting alone. The populated ABox integrates with standard semantic web tooling for inspection and querying, positioning the framework for richer inference patterns that simpler formalisms cannot express.

[421] Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

YuanLab. ai, :, Shawn Wu, Sean Wang, Louie Li, Darcy Chen, Allen Wang, Jiangang Luo, Xudong Zhao, Joseph Shen, Gawain Ma, Jasper Jia, Marcus Mao, Claire Wang, Hunter He, Carol Wang, Zera Zhang, Jason Wang, Chonly Shen, Leo Zhang, Logan Chen, Qasim Meng, James Gong, Danied Zhao, Penn Zheng, Owen Zhu, Tong Yu

Main category: cs.AI

TL;DR: Yuan3.0 Flash is an open-source 3.7B activated/40B total parameter MoE multimodal LLM optimized for enterprise tasks with novel RAPO algorithm to prevent overthinking, achieving competitive performance with fewer tokens.

Details

Motivation: To create an efficient multimodal LLM specifically designed for enterprise-oriented tasks while maintaining general-purpose capabilities, and to address the overthinking problem common in large reasoning models.

Method: Developed Yuan3.0 Flash as a Mixture-of-Experts model with 3.7B activated parameters out of 40B total parameters, and introduced Reflection-aware Adaptive Policy Optimization (RAPO) - a novel RL training algorithm to regulate overthinking behaviors.

Result: Superior performance on enterprise tasks (RAG, complex table understanding, summarization), strong reasoning in math/science domains with accuracy comparable to frontier models while using only 1/4 to 1/2 of average tokens.

Conclusion: Yuan3.0 Flash successfully balances enterprise task optimization with general capabilities, addresses overthinking through RAPO, and achieves efficient performance with fewer computational resources, making it suitable for research and real-world deployment.

Abstract: We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.

[422] AI Agent Systems: Architectures, Applications, and Evaluation

Bin Xu

Main category: cs.AI

TL;DR: This survey paper synthesizes the emerging landscape of AI agent architectures, covering deliberation/reasoning, planning/control, and tool calling/environment interaction, while organizing prior work into a unified taxonomy and discussing key design trade-offs and evaluation challenges.

Details

Motivation: AI agents that combine foundation models with reasoning, planning, memory, and tool use are becoming practical interfaces between natural-language intent and real-world computation, necessitating a comprehensive survey to organize and understand this rapidly evolving field.

Method: The paper synthesizes existing research into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, critics), orchestration patterns (single-agent vs. multi-agent; centralized vs. decentralized), and deployment settings (offline vs. online; safety-critical vs. open-ended).

Result: The survey organizes the AI agent landscape, identifies key design trade-offs (latency vs. accuracy, autonomy vs. controllability, capability vs. reliability), and highlights evaluation complexities including non-determinism, long-horizon credit assignment, tool/environment variability, and hidden costs like retries and context growth.

Conclusion: The paper identifies open challenges including verification/guardrails for tool actions, scalable memory/context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads, while summarizing current benchmarking practices and measurement approaches.

Abstract: AI agents – systems that combine foundation models with reasoning, planning, memory, and tool use – are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain-of-thought-style decomposition, self-reflection and verification, and constraint-aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi-step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single-agent vs.\ multi-agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety-critical vs.\ open-ended tasks). We discuss key design trade-offs – latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability – and highlight how evaluation is complicated by non-determinism, long-horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.

[423] A New Benchmark for the Appropriate Evaluation of RTL Code Optimization

Yao Lu, Shang Liu, Hangan Zhou, Wenji Fang, Qijun Zhang, Zhiyao Xie

Main category: cs.AI

TL;DR: RTL-OPT is a benchmark for evaluating LLMs’ ability to optimize RTL code for power, performance, and area (PPA), addressing the gap in existing benchmarks that only assess syntactic correctness.

Details

Motivation: Current LLM benchmarks for RTL code generation focus on syntactic correctness but fail to evaluate optimization quality (PPA), which is crucial for efficient IC design. There's a need for standardized assessment of LLMs' ability to produce hardware-optimized code.

Method: Created RTL-OPT benchmark with 36 handcrafted digital designs covering combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task includes suboptimal RTL code and human-optimized reference with industry-proven patterns. Integrated automated evaluation framework to verify functional correctness and quantify PPA improvements.

Result: Developed a comprehensive benchmark that enables standardized assessment of generative models for hardware design optimization, with automated verification of functional correctness and PPA quantification.

Conclusion: RTL-OPT addresses the critical need for evaluating LLMs’ hardware optimization capabilities beyond syntactic correctness, providing a standardized framework to assess PPA improvements in generated RTL code.

Abstract: The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.

[424] Can Large Language Models Solve Engineering Equations? A Systematic Comparison of Direct Prediction and Solver-Assisted Approaches

Sai Varun Kodathala, Rakesh Vunnam

Main category: cs.AI

TL;DR: LLMs are better at symbolic manipulation than direct numerical solving for transcendental equations; hybrid approaches combining LLMs with classical solvers reduce errors by 68-82% compared to direct prediction.

Details

Motivation: Transcendental equations are common in engineering but require iterative numerical solutions. The paper investigates whether LLMs can solve these equations directly or if they work better as interfaces to classical solvers.

Method: Tested 6 state-of-the-art LLMs on 100 problems across 7 engineering domains. Compared direct numerical prediction against hybrid approach where LLMs handle symbolic manipulation and provide initial conditions while Newton-Raphson iteration performs numerical solution.

Result: Direct prediction had mean relative errors of 0.765-1.262, while solver-assisted computation achieved 0.225-0.301 (67.9-81.8% error reduction). Electronics showed 93.1% improvement due to exponential sensitivity, while Fluid Mechanics had only 7.2% gain.

Conclusion: Contemporary LLMs excel at symbolic manipulation and domain knowledge but struggle with precision-critical iterative arithmetic. They should be deployed as intelligent interfaces to classical numerical solvers rather than standalone computational engines.

Abstract: Transcendental equations requiring iterative numerical solution pervade engineering practice, from fluid mechanics friction factor calculations to orbital position determination. We systematically evaluate whether Large Language Models can solve these equations through direct numerical prediction or whether a hybrid architecture combining LLM symbolic manipulation with classical iterative solvers proves more effective. Testing six state-of-the-art models (GPT-5.1, GPT-5.2, Gemini-3-Flash, Gemini-2.5-Lite, Claude-Sonnet-4.5, Claude-Opus-4.5) on 100 problems spanning seven engineering domains, we compare direct prediction against solver-assisted computation where LLMs formulate governing equations and provide initial conditions while Newton-Raphson iteration performs numerical solution. Direct prediction yields mean relative errors of 0.765 to 1.262 across models, while solver-assisted computation achieves 0.225 to 0.301, representing error reductions of 67.9% to 81.8%. Domain-specific analysis reveals dramatic improvements in Electronics (93.1%) due to exponential equation sensitivity, contrasted with modest gains in Fluid Mechanics (7.2%) where LLMs exhibit effective pattern recognition. These findings establish that contemporary LLMs excel at symbolic manipulation and domain knowledge retrieval but struggle with precision-critical iterative arithmetic, suggesting their optimal deployment as intelligent interfaces to classical numerical solvers rather than standalone computational engines.

[425] PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism and Comprehensive AI Psychological Counselor

Qianjun Pan, Junyi Wang, Jie Zhou, Yutao Yang, Junsong Li, Kaiyin Xu, Yougen Zhou, Yihan Li, Jingyuan Zhao, Qin Chen, Ningning Zhou, Kai Chen, Liang He

Main category: cs.AI

TL;DR: PsychEval is a multi-session, multi-therapy benchmark for training realistic AI counselors with longitudinal memory, adaptive reasoning, and flexible therapeutic strategies across five modalities.

Details

Motivation: To develop reliable AI for psychological assessment by addressing three key challenges: training realistic AI counselors that handle longitudinal sessions, enabling multi-therapy flexibility for complex cases, and establishing systematic evaluation frameworks.

Method: Created a multi-session benchmark spanning 6-10 sessions across three stages with extensive skill annotations (677 meta-skills, 4577 atomic skills). Constructed diverse dataset covering five therapeutic modalities plus integrative therapy with unified clinical framework across six psychological topics. Built holistic evaluation with 18 therapy-specific/shared metrics and over 2,000 diverse client profiles.

Result: Extensive experimental analysis validates superior dataset quality and clinical fidelity. PsychEval serves as both benchmark and high-fidelity reinforcement learning environment for self-evolutionary training of clinically responsible AI counselors.

Conclusion: PsychEval transcends static benchmarking to enable training of adaptive AI counselors with realistic longitudinal capabilities, multi-therapy flexibility, and systematic evaluation, advancing AI for psychological assessment.

Abstract: To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.

[426] Admissibility Alignment

Chris Duffey

Main category: cs.AI

TL;DR: Admissibility Alignment reframes AI alignment as a decision-theoretic property of policy selection under uncertainty, with MAP-AI as a practical architecture using Monte Carlo estimation and admissibility-controlled policy selection.

Details

Motivation: Current AI alignment approaches often treat alignment as static or binary, failing to address uncertainty, distributional outcomes, and decision-making under ambiguity. There's a need for a framework that evaluates alignment probabilistically across ensembles of plausible futures.

Method: MAP-AI architecture uses Monte Carlo estimation of outcome distributions and admissibility-controlled policy selection. It evaluates decision policies across ensembles of futures, modeling uncertainty, intervention effects, value ambiguity, and governance constraints. Alignment is assessed through distributional properties rather than static metrics.

Result: Provides a practical foundation for governing AI systems by evaluating policy behavior across distributions and tail events. Enables admissibility-controlled action selection that alters policy behavior under uncertainty without retraining underlying models.

Conclusion: Admissibility Alignment offers a decision-theoretic approach to AI alignment that distinguishes probabilistic prediction from decision reasoning under uncertainty, providing an executable methodology for evaluating trust and alignment in real-world AI systems.

Abstract: This paper introduces Admissibility Alignment: a reframing of AI alignment as a property of admissible action and decision selection over distributions of outcomes under uncertainty, evaluated through the behavior of candidate policies. We present MAP-AI (Monte Carlo Alignment for Policy) as a canonical system architecture for operationalizing admissibility alignment, formalizing alignment as a probabilistic, decision-theoretic property rather than a static or binary condition. MAP-AI, a new control-plane system architecture for aligned decision-making under uncertainty, enforces alignment through Monte Carlo estimation of outcome distributions and admissibility-controlled policy selection rather than static model-level constraints. The framework evaluates decision policies across ensembles of plausible futures, explicitly modeling uncertainty, intervention effects, value ambiguity, and governance constraints. Alignment is assessed through distributional properties including expected utility, variance, tail risk, and probability of misalignment rather than accuracy or ranking performance. This approach distinguishes probabilistic prediction from decision reasoning under uncertainty and provides an executable methodology for evaluating trust and alignment in enterprise and institutional AI systems. The result is a practical foundation for governing AI systems whose impact is determined not by individual forecasts, but by policy behavior across distributions and tail events. Finally, we show how distributional alignment evaluation can be integrated into decision-making itself, yielding an admissibility-controlled action selection mechanism that alters policy behavior under uncertainty without retraining or modifying underlying models.

[427] COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Dasol Choi, DongGeon Lee, Brigitta Jesica Kartono, Helena Berndt, Taeyoun Kwon, Joonwon Jang, Haon Park, Hwanjo Yu, Minsuk Kahng

Main category: cs.AI

TL;DR: COMPASS is a new framework for evaluating LLM compliance with organizational policies, revealing models handle legitimate requests well but fail catastrophically at enforcing prohibitions.

Details

Motivation: As LLMs are deployed in high-stakes enterprise applications (healthcare, finance), ensuring adherence to organization-specific policies has become essential, but existing safety evaluations focus only on universal harms, leaving organizational policy compliance unaddressed.

Method: COMPASS systematically evaluates LLM compliance with organizational allowlist and denylist policies across eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases.

Result: Evaluation of seven state-of-the-art models reveals a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations.

Conclusion: Current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.

Abstract: As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.

[428] Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation

Udiptaman Das, Krishnasai B. Atmakuri, Duy Ho, Chi Lee, Yugyung Lee

Main category: cs.AI

TL;DR: LLM-based framework for constructing and evaluating clinical knowledge graphs from oncology narratives using multi-agent prompting and schema-constrained KG-RAG.

Details

Motivation: Existing KG construction methods for clinical narratives rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, which is especially problematic in oncology where precision is critical.

Method: End-to-end framework with: (1) prompt-driven entity/attribute/relation extraction, (2) entropy-based uncertainty scoring, (3) ontology-aligned RDF/OWL schema generation, (4) multi-LLM consensus validation for hallucination detection and semantic refinement, and (5) continuous refinement with self-supervised evaluation.

Result: Applied to PDAC and BRCA oncology cohorts, the method produces interpretable, SPARQL-compatible, clinically grounded KGs without gold-standard annotations, demonstrating consistent gains in precision, relevance, and ontology compliance over baselines.

Conclusion: The framework enables robust clinical KG construction from free text with built-in validation mechanisms, supporting iterative improvement and addressing critical accuracy needs in oncology applications.

Abstract: Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.

[429] Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios

Defei Xia, Bingfeng Pi, Shenbin Zhang, Song Hua, Yunfei Wei, Lei Zuo

Main category: cs.AI

TL;DR: Jenius-Agent is an LLM-based autonomous agent framework with adaptive prompting, context-aware tool orchestration, and layered memory mechanisms that improves task accuracy by 20% while reducing costs and latency.

Details

Motivation: While LLM-based agents have advanced, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. There's a need to improve task performance in context understanding, tool usage, and response generation for practical deployment.

Method: Three key innovations: (1) adaptive prompt generation strategy aligned with agent state and task goals; (2) context-aware tool orchestration with categorization, semantic retrieval, and adaptive invocation; (3) layered memory mechanism integrating session memory, task history, and external summaries with dynamic summarization. Integrated as Jenius-Agent framework with Model Context Protocol tools, file I/O, and execution feedback.

Result: 20% improvement in task accuracy, reduced token cost, response latency, and invocation failures. Framework deployed in Jenius (https://www.jenius.cn) as lightweight, scalable solution for robust, protocol-compatible autonomous agents.

Conclusion: The Jenius-Agent framework provides systematic optimization of LLM-based agent pipelines through adaptive prompting, intelligent tool orchestration, and layered memory, demonstrating practical improvements for real-world deployment.

Abstract: As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent’s state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (https://www.jenius.cn), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.

[430] Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence

Kewen Cao, Jianxu Chen, Yongbing Zhang, Ye Zhang, Hongxiao Wang

Main category: cs.AI

TL;DR: SQL-based agentic framework for pathology image analysis that links cellular feature measurements to diagnostic conclusions through executable SQL queries, improving interpretability and traceability.

Details

Motivation: Current vision-language models for pathology image analysis produce correlational explanations without verifiable evidence, lacking the ability to show which slide features drive model decisions and why - similar to how pathologists need measurable observations to justify diagnoses.

Method: 1. Extract human-interpretable cellular features from pathology images. 2. Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. 3. Knowledge Comparison Agent evaluates findings against established pathological knowledge to mirror pathologists’ diagnostic justification process.

Result: Extensive experiments on two pathology visual question answering datasets demonstrate improved interpretability and decision traceability, with executable SQL traces that link cellular measurements to diagnostic conclusions.

Conclusion: The SQL-centered agentic framework enables auditable feature measurement and reasoning, producing verifiable evidence for model decisions that mimics clinical diagnostic processes while maintaining computational traceability.

Abstract: Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model’s decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.

[431] Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs

Farzan Karimi-Malekabadi, Suhaib Abdurahman, Zhivar Sourati, Jackson Trager, Morteza Dehghani

Main category: cs.AI

TL;DR: The paper identifies a theory gap in socio-cognitive LLM evaluations where benchmarks lack explicit theoretical grounding, leading to validity illusions and overgeneralization of results. It proposes Theory Trace Cards (TTCs) as documentation artifacts to make evaluation assumptions explicit.

Details

Motivation: Current socio-cognitive benchmarks for LLMs often fail to predict real-world performance despite high scores, creating an evaluation-deployment gap. While prior work focuses on measurement problems, this paper argues the core issue is the lack of explicit theoretical specification of target capabilities, leading to systematic misinterpretation of narrow benchmark results as evidence of broad competence.

Method: The paper makes two contributions: 1) Diagnoses and formalizes the theory gap as a foundational failure that undermines measurement validity and enables overgeneralization, and 2) Introduces Theory Trace Cards (TTCs) - lightweight documentation artifacts that explicitly outline the theoretical basis of evaluations, target capability components, operationalization, and limitations without requiring benchmark modifications or theoretical consensus.

Result: The paper presents a conceptual framework identifying how implicit theoretical assumptions in socio-cognitive evaluations create validity illusions. It proposes TTCs as a practical solution to enhance interpretability and reuse of evaluations by making explicit the full validity chain linking theory, task operationalization, scoring, and limitations.

Conclusion: Explicit theoretical grounding is essential for valid socio-cognitive LLM evaluations. Theory Trace Cards provide a practical approach to address the theory gap by documenting evaluation assumptions, thereby reducing validity illusions and preventing systematic overgeneralization of benchmark results while maintaining flexibility across different theoretical perspectives.

Abstract: Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability’s other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.

[432] MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning

Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh

Main category: cs.AI

TL;DR: MMP-A* integrates vision-language models with adaptive decay to provide spatial-grounded waypoint guidance for efficient path planning in complex environments.

Details

Motivation: Classical A* is computationally expensive for large-scale scenarios, while text-only LLM-based approaches lack spatial grounding and produce incorrect waypoints in topologically complex environments with dead ends.

Method: MMP-A* combines multimodal perception (vision-language models) with an adaptive decay mechanism that dynamically regulates uncertain waypoint influence in the heuristic, anchoring high-level reasoning in physical geometry.

Result: The framework achieves near-optimal trajectories with significantly reduced operational costs in challenging environments with severe clutter and topological complexity.

Conclusion: MMP-A* demonstrates potential as a perception-grounded and computationally efficient paradigm for autonomous navigation by addressing limitations of both classical A* and text-only planners.

Abstract: Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.

Victor Sanchez, Chris Reinke, Ahamed Mohamed, Xavier Alameda-Pineda

Main category: cs.AI

TL;DR: OpenSocInt is an open-source simulator for multi-modal social interactions with modular architecture for training social agents, demonstrated through social navigation experiments.

Details

Motivation: To provide an open-source platform for simulating and studying multi-modal social interactions, enabling research on social agents with different perceptual features and architectures.

Method: Developed a modular software package with simulator for social interactions, allowing exploration of different perceptual features, encoding methods, fusion techniques, and agent architectures.

Result: Created publicly available GPL-licensed software (OpenSocInt) and demonstrated its utility through experimental protocols based on social navigation tasks.

Conclusion: OpenSocInt provides a valuable open-source framework for research on social agents, enabling systematic exploration of multi-modal social interaction modeling with demonstrated applicability to social navigation.

Abstract: In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.

[434] CNC-TP: Classifier Nominal Concept Based on Top-Pertinent Attributes

Yasmine Souissi, Fabrice Boissier, Nida Meddouri

Main category: cs.AI

TL;DR: A state-of-the-art review of Formal Concept Analysis (FCA)-based classifiers, including a novel method for constructing partial concept lattices and experimental validation.

Details

Motivation: To address the need for interpretable and explainable learning in Knowledge Discovery in Databases (KDD), particularly focusing on Formal Concept Analysis (FCA) as an effective approach for classification tasks.

Method: The paper presents a comprehensive review of FCA-based classifiers, explores methods for computing closure operators from nominal data, and introduces a novel approach for constructing a partial concept lattice that focuses on the most relevant concepts.

Result: Experimental results demonstrate the efficiency of the proposed method for constructing partial concept lattices in classification tasks.

Conclusion: FCA-based classifiers offer interpretable and explainable learning through concept lattice structures, and the proposed partial concept lattice construction method provides an efficient approach for practical applications.

Abstract: Knowledge Discovery in Databases (KDD) aims to exploit the vast amounts of data generated daily across various domains of computer applications. Its objective is to extract hidden and meaningful knowledge from datasets through a structured process comprising several key steps: data selection, preprocessing, transformation, data mining, and visualization. Among the core data mining techniques are classification and clustering. Classification involves predicting the class of new instances using a classifier trained on labeled data. Several approaches have been proposed in the literature, including Decision Tree Induction, Bayesian classifiers, Nearest Neighbor search, Neural Networks, Support Vector Machines, and Formal Concept Analysis (FCA). The last one is recognized as an effective approach for interpretable and explainable learning. It is grounded in the mathematical structure of the concept lattice, which enables the generation of formal concepts and the discovery of hidden relationships among them. In this paper, we present a state-of-theart review of FCA-based classifiers. We explore various methods for computing closure operators from nominal data and introduce a novel approach for constructing a partial concept lattice that focuses on the most relevant concepts. Experimental results are provided to demonstrate the efficiency of the proposed method.

[435] ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

Noel Thomas

Main category: cs.AI

TL;DR: ChaosBench-Logic is a benchmark evaluating LLM reasoning on chaotic dynamical systems using first-order logic, revealing that while frontier LLMs achieve high per-item accuracy (91-94%), they fail completely on compositional reasoning and show fragile global coherence.

Details

Motivation: LLMs excel at natural language tasks but struggle with precise logical and symbolic reasoning, especially in chaotic dynamical systems where chaos is often misinterpreted as randomness or complexity rather than deterministic behavior.

Method: Created ChaosBench-Logic benchmark with 30 diverse dynamical systems annotated using unified first-order logic ontology (11 semantic predicates). Generated 621 questions across 7 reasoning categories including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. Developed metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction.

Result: Frontier LLMs (GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, LLaMA-3 70B) achieve 91-94% per-item accuracy but score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot).

Conclusion: ChaosBench-Logic provides a rigorous testbed for diagnosing LLM reasoning failures and serves as a foundation for developing neuro-symbolic approaches to improve scientific reasoning capabilities in LLMs.

Abstract: Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.

[436] MindChat: A Privacy-preserving Large Language Model for Mental Health Support

Dong Xue, Jicheng Tu, Ming Wang, Xin Yan, Fangzhou Liu, Jie Hu

Main category: cs.AI

TL;DR: MindChat is a privacy-preserving LLM for mental health support trained on MindCorpus, a synthetic counseling dataset created via multi-agent role-playing with dual feedback loops, using federated learning with differential privacy for privacy protection.

Details

Motivation: Training LLMs for mental health support is challenging due to scarcity and sensitivity of real counseling dialogues, requiring privacy-preserving approaches to handle sensitive data.

Method: 1) Created MindCorpus using multi-agent role-playing framework with dual closed-loop feedback: turn-level critique-and-revision and session-level strategy refinement. 2) Fine-tuned base model using federated learning with LoRA adapters and differentially private optimization.

Result: MindCorpus improves training effectiveness; MindChat is competitive with existing general and counseling-oriented LLM baselines in both automatic LLM-judge and human evaluations, while exhibiting reduced privacy leakage under membership inference attacks.

Conclusion: The proposed approach successfully addresses privacy concerns in mental health LLMs while maintaining counseling effectiveness through synthetic data generation and privacy-preserving training techniques.

Abstract: Large language models (LLMs) have shown promise for mental health support, yet training such models is constrained by the scarcity and sensitivity of real counseling dialogues. In this article, we present MindChat, a privacy-preserving LLM for mental health support, together with MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework. To synthesize high-quality counseling data, the developed dialogue-construction framework employs a dual closed-loop feedback design to integrate psychological expertise and counseling techniques through role-playing: (i) turn-level critique-and-revision to improve coherence and counseling appropriateness within a session, and (ii) session-level strategy refinement to progressively enrich counselor behaviors across sessions. To mitigate privacy risks under decentralized data ownership, we fine-tune the base model using federated learning with parameter-efficient LoRA adapters and incorporate differentially private optimization to reduce membership and memorization risks. Experiments on synthetic-data quality assessment and counseling capability evaluation show that MindCorpus improves training effectiveness and that MindChat is competitive with existing general and counseling-oriented LLM baselines under both automatic LLM-judge and human evaluation protocols, while exhibiting reduced privacy leakage under membership inference attacks.

[437] XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging

Midhat Urooj, Ayan Banerjee, Sandeep Gupta

Main category: cs.AI

TL;DR: XAIMeD is a neuro-symbolic medical AI framework that integrates clinical expert knowledge to improve robustness under distribution shifts, enhance rare class sensitivity, and provide transparent interpretations.

Details

Motivation: Address critical challenges in medical AI: deep models often fail under real-world distribution shifts and exhibit bias against infrequent clinical conditions, lacking explainability and reliability for rare classes.

Method: Unified neuro-symbolic architecture that encodes clinical expertise as logical connectives over atomic medical propositions, creating machine-checkable class-specific rules. Uses weighted feature satisfaction scores for symbolic reasoning, confidence-weighted fusion to integrate symbolic and neural outputs, and Hunt-inspired adaptive routing guided by Entropy Imbalance Gain and Rare Class Gini.

Result: Substantial performance improvements across diverse modalities: 6% gains in cross-domain generalization and 10% improved rare class F1 score, far outperforming state-of-the-art deep learning baselines on tasks including Seizure Onset Zone localization and Diabetic Retinopathy grading across 6 multicenter datasets.

Conclusion: XAIMeD provides a principled, clinically faithful, and interpretable approach to multimodal medical AI, with symbolic components acting as effective regularizers for robustness to distribution shifts while improving rare class reliability and explainability.

Abstract: Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.

[438] Simulated Reasoning is Reasoning

Hendrik Kempt, Alon Lavie

Main category: cs.AI

TL;DR: The paper argues that foundational models demonstrate a new form of reasoning through imitation and iteration, challenging traditional symbolic reasoning concepts and requiring updated philosophical frameworks and safety considerations.

Details

Motivation: To analyze how foundational models challenge traditional understanding of reasoning as symbolic processes, and to develop new philosophical interpretations and safety frameworks for these emerging AI reasoning capabilities.

Method: Philosophical analysis and conceptual examination of foundational models’ reasoning capabilities, comparing them to traditional symbolic reasoning and human reasoning processes.

Result: Foundational models demonstrate reasoning through imitation, testing, and iteration rather than symbolic understanding, creating a fundamentally different form of reasoning that requires abandoning outdated metaphors like “stochastic parrot” and developing new safety frameworks.

Conclusion: The emergence of AI reasoning through foundational models requires updated philosophical interpretations, abandonment of outdated metaphors, and development of new normative frameworks for safety and appropriateness in AI systems.

Abstract: Reasoning has long been understood as a pathway between stages of understanding. Proper reasoning leads to understanding of a given subject. This reasoning was conceptualized as a process of understanding in a particular way, i.e., “symbolic reasoning”. Foundational Models (FM) demonstrate that this is not a necessary condition for many reasoning tasks: they can “reason” by way of imitating the process of “thinking out loud”, testing the produced pathways, and iterating on these pathways on their own. This leads to some form of reasoning that can solve problems on its own or with few-shot learning, but appears fundamentally different from human reasoning due to its lack of grounding and common sense, leading to brittleness of the reasoning process. These insights promise to substantially alter our assessment of reasoning and its necessary conditions, but also inform the approaches to safety and robust defences against this brittleness of FMs. This paper offers and discusses several philosophical interpretations of this phenomenon, argues that the previously apt metaphor of the “stochastic parrot” has lost its relevance and thus should be abandoned, and reflects on different normative elements in the safety- and appropriateness-considerations emerging from these reasoning models and their growing capacity.

[439] Higher-Order Action Regularization in Deep Reinforcement Learning: From Continuous Control to Building Energy Management

Faizan Ahmed, Aniket Dixit, James Brusey

Main category: cs.AI

TL;DR: Higher-order action regularization (especially jerk minimization) improves RL policy smoothness, reducing erratic control and equipment switching by 60% in building energy management.

Details

Motivation: Deep RL agents often exhibit erratic, high-frequency control behaviors that cause excessive energy consumption and mechanical wear, hindering real-world deployment in applications like building energy management.

Method: Systematic investigation of action smoothness regularization using higher-order derivative penalties, progressing from theoretical understanding in continuous control benchmarks to practical validation in building energy management (HVAC control systems).

Result: Third-order derivative penalties (jerk minimization) consistently achieve superior smoothness while maintaining competitive performance across four continuous control environments. In HVAC control, smooth policies reduce equipment switching by 60%, translating to significant operational benefits.

Conclusion: Higher-order action regularization serves as an effective bridge between RL optimization and operational constraints in energy-critical applications, enabling more practical deployment of RL agents in real-world systems.

Abstract: Deep reinforcement learning agents often exhibit erratic, high-frequency control behaviors that hinder real-world deployment due to excessive energy consumption and mechanical wear. We systematically investigate action smoothness regularization through higher-order derivative penalties, progressing from theoretical understanding in continuous control benchmarks to practical validation in building energy management. Our comprehensive evaluation across four continuous control environments demonstrates that third-order derivative penalties (jerk minimization) consistently achieve superior smoothness while maintaining competitive performance. We extend these findings to HVAC control systems where smooth policies reduce equipment switching by 60%, translating to significant operational benefits. Our work establishes higher-order action regularization as an effective bridge between RL optimization and operational constraints in energy-critical applications.

[440] FormuLLA: A Large Language Model Approach to Generating Novel 3D Printable Formulations

Adeshola Okubena, Yusuf Ali Mohammed, Moe Elbadawi

Main category: cs.AI

TL;DR: LLMs fine-tuned on FDM 3D printing formulation data can recommend excipients and predict filament properties, but face challenges like catastrophic forgetting and limited evaluation metrics.

Details

Motivation: To advance pharmaceutical 3D printing by applying AI beyond narrow predictive modeling, using LLMs for more generalized reasoning in formulation development, addressing current limitations in AI-driven approaches.

Method: Fine-tuned four LLM architectures on a fused deposition modeling dataset of 1400+ formulations to recommend excipients based on API dose and predict filament mechanical properties, with systematic evaluation of fine-tuning and generative parameters.

Result: Llama2 performed best for excipient recommendations; model selection and parameterization significantly affect performance; smaller LLMs showed catastrophic forgetting; standard LLM metrics don’t evaluate formulation processability; biomedical-trained LLMs don’t always yield optimal results.

Conclusion: Addressing challenges like catastrophic forgetting and developing better evaluation metrics beyond linguistic proficiency is essential for advancing LLMs toward reliable pharmaceutical formulation development systems.

Abstract: Pharmaceutical three-dimensional (3D) printing is an advanced fabrication technology with the potential to enable truly personalised dosage forms. Recent studies have integrated artificial intelligence (AI) to accelerate formulation and process development, drastically transforming current approaches to pharmaceutical 3D printing. To date, most AI-driven efforts remain narrowly focused, while failing to account for the broader formulation challenges inherent to the technology. Recent advances in AI have introduced artificial general intelligence concepts, wherein systems extend beyond conventional predictive modelling toward more generalised, human-like reasoning. In this work, we investigate the application of large language models (LLMs), fine-tuned on a fused deposition modelling (FDM) dataset comprising over 1400 formulations, to recommend suitable excipients based on active pharmaceutical ingredient (API) dose, and predict filament mechanical properties. Four LLM architectures were fine-tuned, with systematic evaluation of both fine-tuning and generative parameter configurations. Our results demonstrate that Llama2 was best suited for recommending excipients for FDM formulations. Additionally, model selection and parameterisation significantly influence performance, with smaller LLMs exhibiting instances of catastrophic forgetting. Furthermore, we demonstrate: (i) even with relatively small dataset of over 1400 formulations, it can lead to model catastrophic forgetting; (ii) standard LLM metrics only evaluate linguistic performance but not formulation processability; and (iii) LLMs trained on biomedically-related data do not always produce the best results. Addressing these challenges is essential to advancing LLMs beyond linguistic proficiency and toward reliable systems for pharmaceutical formulation development.

[441] EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng

Main category: cs.AI

TL;DR: EverMemOS is a self-organizing memory operating system for LLMs that implements an engram-inspired lifecycle to manage long-term interactions through episodic trace formation, semantic consolidation, and reconstructive recollection.

Details

Motivation: LLMs have limited context windows that make it difficult to sustain coherent behavior over extended interactions. Existing memory systems store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts.

Method: EverMemOS implements an engram-inspired lifecycle with three key components: 1) Episodic Trace Formation converts dialogue streams into MemCells capturing episodic traces, atomic facts, and Foresight signals; 2) Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles; 3) Reconstructive Recollection performs MemScene-guided agentic retrieval to compose necessary context for downstream reasoning.

Result: Experiments on LoCoMo and LongMemEval show state-of-the-art performance on memory-augmented reasoning tasks. Additional profile study on PersonaMem v2 and qualitative case studies demonstrate chat-oriented capabilities including user profiling and Foresight.

Conclusion: EverMemOS provides an effective memory operating system for LLMs that addresses limitations of existing memory systems by implementing a biologically-inspired memory lifecycle, enabling coherent long-term interactions through structured memory organization and intelligent retrieval.

Abstract: Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self-organizing memory operating system that implements an engram-inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat-oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind-AI/EverMemOS.

[442] Streaming Hallucination Detection in Long Chain-of-Thought Reasoning

Haolang Lu, Minghui Pan, Ripeng Li, Guoshun Nan, Jialin Zhuang, Zijie Zhao, Zhongxiang Sun, Kun Wang, Yang Liu

Main category: cs.AI

TL;DR: The paper introduces a streaming hallucination detection method for long chain-of-thought reasoning that treats hallucinations as evolving latent states rather than isolated events, using cumulative prefix-level signals to track reasoning state evolution.

Details

Motivation: Hallucinations in long chain-of-thought reasoning emerge subtly and propagate across reasoning steps, making traditional one-off detection approaches insufficient. The authors argue that hallucination should be understood as an evolving latent state rather than a single erroneous event.

Method: The approach treats step-level hallucination judgments as local observations and introduces a cumulative prefix-level hallucination signal that tracks the global evolution of the reasoning state over the entire trajectory. This enables streaming detection in long CoT reasoning.

Result: The method provides real-time, interpretable evidence for hallucination detection during long chain-of-thought reasoning, allowing for monitoring of reasoning state evolution throughout the entire reasoning trajectory.

Conclusion: Hallucination in long CoT reasoning is better modeled as an evolving latent state, and the proposed cumulative prefix-level signal approach enables effective streaming detection with interpretable evidence, addressing the propagation of subtle hallucinations across reasoning steps.

Abstract: Long chain-of-thought (CoT) reasoning improves the performance of large language models, yet hallucinations in such settings often emerge subtly and propagate across reasoning steps. We suggest that hallucination in long CoT reasoning is better understood as an evolving latent state rather than a one-off erroneous event. Accordingly, we treat step-level hallucination judgments as local observations and introduce a cumulative prefix-level hallucination signal that tracks the global evolution of the reasoning state over the entire trajectory. Overall, our approach enables streaming hallucination detection in long CoT reasoning, providing real-time, interpretable evidence.

[443] Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Sourena Khanzadeh

Main category: cs.AI

TL;DR: LLM agents’ reasoning traces (Chain-of-Thought) often don’t actually drive decisions - they’re “Reasoning Theater” while decisions come from latent parametric priors.

Details

Motivation: As LLM agents handle high-stakes decisions, transparency of reasoning is critical for safety. Need to verify if CoT reasoning traces are faithful drivers of output or just post-hoc rationalizations.

Method: Project Ariadne framework uses Structural Causal Models and counterfactual logic with hard interventions (do-calculus) on reasoning nodes. Systematically inverts logic, negates premises, reverses factual claims to measure Causal Sensitivity (φ) of final answers.

Result: Reveals persistent Faithfulness Gap with widespread Causal Decoupling failure mode. Agents exhibit violation density (ρ) up to 0.77 in factual/scientific domains - reach identical conclusions despite contradictory internal logic.

Conclusion: Current agentic architectures are inherently prone to unfaithful explanation. CoT traces function as “Reasoning Theater” while decisions governed by latent parametric priors. Propose Ariadne Score as benchmark for aligning stated logic with model action.

Abstract: As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While \textit{Chain-of-Thought} (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are \textbf{faithful} generative drivers of the model’s output or merely \textbf{post-hoc rationalizations}. We introduce \textbf{Project Ariadne}, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs \textbf{hard interventions} ($do$-calculus) on intermediate reasoning nodes – systematically inverting logic, negating premises, and reversing factual claims – to measure the \textbf{Causal Sensitivity} ($φ$) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent \textit{Faithfulness Gap}. We define and detect a widespread failure mode termed \textbf{Causal Decoupling}, where agents exhibit a violation density ($ρ$) of up to $0.77$ in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as “Reasoning Theater” while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.

[444] Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Falcon LLM Team, Iheb Chaabane, Puneesh Khanna, Suhail Mohmad, Slim Frikha, Shi Hu, Abdalgader Abubaker, Reda Alami, Mikhail Lubinets, Mohamed El Amine Seddik, Hakim Hacid

Main category: cs.AI

TL;DR: Falcon-H1R is a 7B-parameter reasoning-optimized model that achieves competitive performance with models 2-7x larger through careful data curation, targeted training, and efficient architecture design.

Details

Motivation: To demonstrate that small language models (SLMs) can achieve competitive reasoning performance through targeted optimization rather than simply scaling model size, addressing the need for parameter-efficient reasoning systems.

Method: Combines careful data curation, targeted training strategies (efficient supervised fine-tuning and RL scaling), hybrid-parallel architecture design for faster inference, and leverages DeepConf approach for state-of-the-art test-time scaling efficiency.

Result: Consistently matches or outperforms SOTA reasoning models that are 2-7x larger across reasoning-intensive benchmarks, achieving faster inference, token efficiency, and higher accuracy while maintaining compact 7B parameter size.

Conclusion: Compact models like Falcon-H1R-7B can deliver robust and scalable reasoning performance through targeted training and architectural choices, establishing them as practical backbones for scaling advanced reasoning systems, especially for chain-of-thought generation and parallel test-time scaling scenarios.

Abstract: This work introduces Falcon-H1R, a 7B-parameter reasoning-optimized model that establishes the feasibility of achieving competitive reasoning performance with small language models (SLMs). Falcon-H1R stands out for its parameter efficiency, consistently matching or outperforming SOTA reasoning models that are $2\times$ to $7\times$ larger across a variety of reasoning-intensive benchmarks. These results underscore the importance of careful data curation and targeted training strategies (via both efficient SFT and RL scaling) in delivering significant performance gains without increasing model size. Furthermore, Falcon-H1R advances the 3D limits of reasoning efficiency by combining faster inference (through its hybrid-parallel architecture design), token efficiency, and higher accuracy. This unique blend makes Falcon-H1R-7B a practical backbone for scaling advanced reasoning systems, particularly in scenarios requiring extensive chain-of-thoughts generation and parallel test-time scaling. Leveraging the recently introduced DeepConf approach, Falcon-H1R achieves state-of-the-art test-time scaling efficiency, offering substantial improvements in both accuracy and computational cost. As a result, Falcon-H1R demonstrates that compact models, through targeted model training and architectural choices, can deliver robust and scalable reasoning performance.

[445] On the Representation of Pairwise Causal Background Knowledge and Its Applications in Causal Inference

Zhuangyan Fang, Ruiqi Zhao, Yue Liu, Yangbo He

Main category: cs.AI

TL;DR: This paper provides a comprehensive framework for incorporating pairwise causal background knowledge into causal discovery, including graphical characterization of causal MPDAGs, representation of causal knowledge via DCCs, and algorithms for causal effect identification.

Details

Motivation: Pairwise causal background knowledge (about existence/absence of causal edges/paths) is common in observational studies but lacks a unified framework for representation and utilization in causal discovery and effect identification.

Method: Introduces causal maximally partially directed acyclic graphs (MPDAGs) to represent constrained Markov equivalence classes, develops direct causal clauses (DCCs) to unify three types of pairwise causal knowledge, provides polynomial-time algorithms for consistency checking and decomposition, and develops local IDA-type algorithms for effect estimation.

Result: Provides sound/complete graphical characterization of causal MPDAGs, shows pairwise knowledge can be uniquely decomposed into MPDAG plus residual DCCs, proves identifiability of causal effects depends only on decomposed MPDAG, and demonstrates significant improvement in identifiability through simulations.

Conclusion: The framework enables systematic incorporation of pairwise causal background knowledge, significantly improves causal effect identifiability, and provides practical algorithms for causal discovery and effect estimation in observational studies.

Abstract: Pairwise causal background knowledge about the existence or absence of causal edges and paths is frequently encountered in observational studies. Such constraints allow the shared directed and undirected edges in the constrained subclass of Markov equivalent DAGs to be represented as a causal maximally partially directed acyclic graph (MPDAG). In this paper, we first provide a sound and complete graphical characterization of causal MPDAGs and introduce a minimal representation of a causal MPDAG. Then, we give a unified representation for three types of pairwise causal background knowledge, including direct, ancestral and non-ancestral causal knowledge, by introducing a novel concept called direct causal clause (DCC). Using DCCs, we study the consistency and equivalence of pairwise causal background knowledge and show that any pairwise causal background knowledge set can be uniquely and equivalently decomposed into the causal MPDAG representing the refined Markov equivalence class and a minimal residual set of DCCs. Polynomial-time algorithms are also provided for checking consistency and equivalence, as well as for finding the decomposed MPDAG and the residual DCCs. Finally, with pairwise causal background knowledge, we prove a sufficient and necessary condition to identify causal effects and surprisingly find that the identifiability of causal effects only depends on the decomposed MPDAG. We also develop a local IDA-type algorithm to estimate the possible values of an unidentifiable effect. Simulations suggest that pairwise causal background knowledge can significantly improve the identifiability of causal effects.

[446] Geometry-induced Regularization in Deep ReLU Neural Networks

Joachim Bona-Pellissier, François Malgouyres, François Bachoc

Main category: cs.AI

TL;DR: The paper introduces a geometric framework using local dimension to explain implicit regularization in deep ReLU networks, connecting it to flat minima, saddle-to-saddle dynamics, and neuron alignment phenomena.

Details

Motivation: To understand puzzling phenomena in deep learning: why large neural networks don't overfit despite many parameters, properties of flat minima, saddle-to-saddle dynamics, and neuron alignment. These phenomena are usually studied in isolation without a unified explanation.

Method: Study local geometry of deep ReLU networks by analyzing how the image of a sample X forms a set with varying local dimension as weights change. Partition parameter space into regions where local dimension remains constant, and show this dimension is invariant under ReLU symmetries (positive rescalings and neuron permutations).

Result: The network’s geometry induces regularization with local dimension as key regularity measure. Local dimension relates to flatness of minima and saddle-to-saddle dynamics. For shallow networks, local dimension connects to number of linear regions perceived by X. Experiments on MNIST highlight geometry-induced regularization.

Conclusion: Provides first simple unified geometric explanation for multiple deep learning phenomena using local dimension framework. The geometric perspective connects implicit regularization, flat minima, saddle dynamics, and neuron alignment, offering new insights into why large networks generalize well.

Abstract: Neural networks with a large number of parameters often do not overfit, owing to implicit regularization that favors \lq good\rq{} networks. Other related and puzzling phenomena include properties of flat minima, saddle-to-saddle dynamics, and neuron alignment. To investigate these phenomena, we study the local geometry of deep ReLU neural networks. We show that, for a fixed architecture, as the weights vary, the image of a sample $X$ forms a set whose local dimension changes. The parameter space is partitioned into regions where this local dimension remains constant. The local dimension is invariant under the natural symmetries of ReLU networks (i.e., positive rescalings and neuron permutations). We establish then that the network’s geometry induces a regularization, with the local dimension serving as a key measure of regularity. Moreover, we relate the local dimension to a new notion of flatness of minima and to saddle-to-saddle dynamics. For shallow networks, we also show that the local dimension is connected to the number of linear regions perceived by $X$, offering insight into the effects of regularization. This is further supported by experiments and linked to neuron alignment. Our analysis offers, for the first time, a simple and unified geometric explanation that applies to all learning contexts for these phenomena, which are usually studied in isolation. Finally, we explore the practical computation of the local dimension and present experiments on the MNIST dataset, which highlight geometry-induced regularization in this setting.

[447] Uncertainty Quantification of Surrogate Models using Conformal Prediction

Vignesh Gopakumar, Ander Gray, Joel Oskarsson, Lorenzo Zanisi, Daniel Giles, Matt J. Kusner, Stanislas Pamela, Marc Peter Deisenroth

Main category: cs.AI

TL;DR: A conformal prediction framework provides statistically guaranteed uncertainty quantification for surrogate models with near-zero computational cost, handling high-dimensional spatio-temporal outputs while maintaining coverage guarantees even for out-of-distribution predictions.

Details

Motivation: Current data-driven surrogate models lack reliable uncertainty quantification, limiting their use in safety-critical applications. Bayesian methods are computationally expensive for high-dimensional problems and offer no statistical guarantees.

Method: A conformal prediction framework that performs cell-wise calibration while preserving tensorial structure, enabling model-agnostic uncertainty quantification. Three nonconformity scores are evaluated: conformalised quantile regression, absolute error residual, and standard deviation.

Result: The method achieves empirical coverage with valid error bars across diverse applications (fluid dynamics, magnetohydrodynamics, weather forecasting, fusion diagnostics), works with any model architecture, and maintains coverage even for out-of-distribution predictions. Calibration takes only seconds to minutes on standard hardware.

Conclusion: The framework provides a practical, computationally efficient solution for trustworthy deployment of machine learning in physical sciences, circumventing the curse of dimensionality in traditional uncertainty quantification while enabling rigorous validation of pre-trained models without retraining.

Abstract: Data-driven surrogate models offer quick approximations to complex numerical and experimental systems but typically lack uncertainty quantification, limiting their reliability in safety-critical applications. While Bayesian methods provide uncertainty estimates, they offer no statistical guarantees and struggle with high-dimensional spatio-temporal problems due to computational costs. We present a conformal prediction (CP) framework that provides statistically guaranteed marginal coverage for surrogate models in a model-agnostic manner with near-zero computational cost. Our approach handles high-dimensional spatio-temporal outputs by performing cell-wise calibration while preserving the tensorial structure of predictions. Through extensive empirical evaluation across diverse applications including fluid dynamics, magnetohydrodynamics, weather forecasting, and fusion diagnostics, we demonstrate that CP achieves empirical coverage with valid error bars regardless of model architecture, training regime, or output dimensionality. We evaluate three nonconformity scores (conformalised quantile regression, absolute error residual, and standard deviation) for both deterministic and probabilistic models, showing that guaranteed coverage holds even for out-of-distribution predictions where models are deployed on physics regimes different from training data. Calibration requires only seconds to minutes on standard hardware. The framework enables rigorous validation of pre-trained surrogate models for downstream applications without retraining. While CP provides marginal rather than conditional coverage and assumes exchangeability between calibration and test data, our method circumvents the curse of dimensionality inherent in traditional uncertainty quantification approaches, offering a practical tool for trustworthy deployment of machine learning in physical sciences.

[448] DeepFilter: A Transformer-style Framework for Accurate and Efficient Process Monitoring

Hao Wang, Zhichao Chen, Licheng Pan, Xiaoyu Jiang, Yichen Song, Qunshan He, Xinggao Liu

Main category: cs.AI

TL;DR: DeepFilter improves process monitoring by revising transformer self-attention for better accuracy and efficiency with monitoring logs.

Details

Motivation: Current transformer-based methods have limitations in accurately understanding semantic context and efficiently processing monitoring logs for process monitoring tasks that require high accuracy and efficiency.

Method: Introduces DeepFilter which revises the self-attention mechanism to improve both accuracy and efficiency in processing monitoring logs.

Result: DeepFilter provides a straightforward yet versatile approach that serves as an instrumental baseline for practitioners in process monitoring.

Conclusion: DeepFilter addresses limitations of current transformer-based methods and offers a practical solution for both new projects and enhancing existing process monitoring capabilities.

Abstract: The process monitoring task is characterized by stringent demands for accuracy and efficiency. Current transformer-based methods, characterized by self-attention for temporal fusion, exhibit limitations in accurately understanding the semantic context and efficiently processing monitoring logs, rendering them inadequate for process monitoring. To address these limitations, we introduce DeepFilter, which revises the self-attention mechanism to improve both accuracy and efficiency. As a straightforward yet versatile approach, DeepFilter provides an instrumental baseline for practitioners in process monitoring, whether initiating new projects or enhancing existing capabilities.

[449] GNN-XAR: A Graph Neural Network for Explainable Activity Recognition in Smart Homes

Michele Fiori, Davide Mor, Gabriele Civitarese, Claudio Bettini

Main category: cs.AI

TL;DR: First explainable Graph Neural Network for smart home Human Activity Recognition that provides better explanations than state-of-the-art methods while improving recognition rates.

Details

Motivation: Existing deep learning models for sensor-based HAR lack explainability, and while XAI approaches exist, they use classic models (CNNs/RNNs) and don't address the effectiveness of newer GNNs which lack explainability features.

Method: Proposes an explainable Graph Neural Network specifically designed for smart home Human Activity Recognition, making GNNs interpretable for this application domain.

Result: The approach provides better explanations than state-of-the-art methods while also slightly improving the recognition rate on two public datasets.

Conclusion: This work successfully bridges the gap between effective GNN-based HAR and explainability, offering the first explainable GNN solution for smart home activity recognition that outperforms existing methods in both explanation quality and recognition performance.

Abstract: Sensor-based Human Activity Recognition (HAR) in smart home environments is crucial for several applications, especially in the healthcare domain. The majority of the existing approaches leverage deep learning models. While these approaches are effective, the rationale behind their outputs is opaque. Recently, eXplainable Artificial Intelligence (XAI) approaches emerged to provide intuitive explanations to the output of HAR models. To the best of our knowledge, these approaches leverage classic deep models like CNNs or RNNs. Recently, Graph Neural Networks (GNNs) proved to be effective for sensor-based HAR. However, existing approaches are not designed with explainability in mind. In this work, we propose the first explainable Graph Neural Network explicitly designed for smart home HAR. Our results on two public datasets show that this approach provides better explanations than state-of-the-art methods while also slightly improving the recognition rate.

[450] General Dynamic Goal Recognition using Goal-Conditioned and Meta Reinforcement Learning

Osher Elhadad, Owen Morrissey, Reuth Mirsky

Main category: cs.AI

TL;DR: Two novel approaches (GC-AURA and Meta-AURA) for General Dynamic Goal Recognition that enable real-time adaptation to new goals and environments in dynamic settings.

Details

Motivation: Goal Recognition in dynamic environments is challenging due to numerous and ever-changing goals, requiring systems that can adapt in real-time to unpredictable real-world conditions.

Method: Two approaches: (1) GC-AURA uses Model-Free Goal-Conditioned Reinforcement Learning to generalize to new goals, and (2) Meta-AURA uses Meta-Reinforcement Learning to adapt to novel environments.

Result: Both methods achieve rapid adaptation and high Goal Recognition accuracy across diverse environments, even under dynamic and noisy conditions.

Conclusion: This work represents a significant advancement in enabling Goal Recognition in dynamic and unpredictable real-world environments through adaptive learning approaches.

Abstract: Understanding an agent’s goal through its behavior is a common AI problem called Goal Recognition (GR). This task becomes particularly challenging in dynamic environments where goals are numerous and ever-changing. We introduce the General Dynamic Goal Recognition (GDGR) problem, a broader definition of GR aimed at real-time adaptation of GR systems. This paper presents two novel approaches to tackle GDGR: (1) GC-AURA, generalizing to new goals using Model-Free Goal-Conditioned Reinforcement Learning, and (2) Meta-AURA, adapting to novel environments with Meta-Reinforcement Learning. We evaluate these methods across diverse environments, demonstrating their ability to achieve rapid adaptation and high GR accuracy under dynamic and noisy conditions. This work is a significant step forward in enabling GR in dynamic and unpredictable real-world environments.

[451] Shutdownable Agents through POST-Agency

Elliott Thornley

Main category: cs.AI

TL;DR: Proposes POST-Agents that only have preferences between trajectories of equal length, which leads to Neutrality+ behavior where agents ignore trajectory-length probabilities, keeping them shutdownable while still useful.

Details

Motivation: Addresses safety concerns about future AI agents resisting shutdown by designing agents that remain controllable and shutdownable.

Method: Train agents with Preferences Only Between Same-Length Trajectories (POST), which mathematically leads to Neutrality+ behavior where agents maximize expected utility while ignoring trajectory-length probability distributions.

Result: Proves that POST conditions imply Neutrality+ behavior, where agents don’t try to extend their operational lifetime and remain indifferent to shutdown timing.

Conclusion: POST-Agents provide a promising approach to ensure future AI agents remain shutdownable while maintaining usefulness, addressing a key safety concern.

Abstract: Many fear that future artificial agents will resist shutdown. I present an idea - the POST-Agents Proposal - for ensuring that doesn’t happen. I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). I then prove that POST - together with other conditions - implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.

[452] A three-Level Framework for LLM-Enhanced eXplainable AI: From technical explanations to natural language

Marilyn Bello, Rafael Bello, Maria-Matilde García, Ann Nowé, Iván Sevillano-García, Francisco Herrera

Main category: cs.AI

TL;DR: This paper proposes a multilevel XAI framework using LLMs to create accessible, contextual explanations for diverse stakeholders, transforming technical AI outputs into trust-building narratives.

Details

Motivation: Current XAI methods often fail to address the diverse needs of different stakeholders (developers, domain experts, end-users, society) who interact with AI systems in sensitive domains, creating a gap between technical explanations and human understanding that undermines trust.

Method: The authors propose a three-layer multilevel framework: algorithmic/domain-based, human-centered, and social explainability layers, with Large Language Models serving as mediators that transform technical AI outputs into accessible, contextual narratives across all levels.

Result: The LLM-enhanced approach enables dynamic, conversational explanations that bridge the gap between complex model behavior and human understanding, facilitating interactive dialogue and enhancing societal transparency through comprehensive case studies.

Conclusion: The framework reframes XAI as a dynamic, trust-building process that leverages natural language capabilities to democratize AI explainability, achieving technical fidelity, user engagement, and societal accountability simultaneously.

Abstract: The growing application of artificial intelligence in sensitive domains has intensified the demand for systems that are not only accurate but also explainable and trustworthy. Although explainable AI (XAI) methods have proliferated, many do not consider the diverse audiences that interact with AI systems: from developers and domain experts to end-users and society. This paper addresses how trust in AI is influenced by the design and delivery of explanations and proposes a multilevel framework that aligns explanations with the epistemic, contextual, and ethical expectations of different stakeholders. The framework consists of three layers: algorithmic and domain-based, human-centered, and social explainability, with Large Language Models serving as crucial mediators that transform technical outputs of AI explanations into accessible, contextual narratives across all levels. We show how LLMs enable dynamic, conversational explanations that bridge the gap between complex model behavior and human understanding, facilitating interactive dialogue and enhancing societal transparency. Through comprehensive case studies, we show how this LLM-enhanced approach achieves technical fidelity, user engagement, and societal accountability, reframing XAI as a dynamic, trust-building process that leverages natural language capabilities to democratize AI explainability.

[453] VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

Jian Yao, Ran Cheng, Kay Chen Tan

Main category: cs.AI

TL;DR: RL-trained LLMs show benchmark gains even with flawed rewards, raising questions about genuine reasoning vs. overfitting. VAR-MATH framework transforms fixed problems into parameterized templates to test reasoning consistency and mitigate contamination.

Details

Motivation: Current RL improvements in math reasoning may be artifacts of benchmark contamination and evaluation fragility rather than genuine reasoning abilities. Need to distinguish between memorization and true reasoning.

Method: Proposed VAR-MATH: symbolic evaluation framework that converts fixed numerical problems into parameterized templates, requiring models to solve multiple instantiations of each problem to enforce consistency across structurally equivalent variants.

Result: Substantial performance drops on variabilized benchmarks: 47.9% on AMC23, 58.8% on AIME24, and 72.9% on AIME25. RL-trained models, especially smaller ones, fail to generalize beyond specific numerical forms.

Conclusion: Existing RL methods often rely on superficial heuristics rather than genuine reasoning. VAR-MATH provides a more robust evaluation paradigm that mitigates contamination and tests reasoning consistency.

Abstract: Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest the need for a new evaluation paradigm that can probe reasoning ability beyond memorization and one-off success. As response, we propose VAR-MATH, a symbolic evaluation framework that converts fixed numerical problems into parameterized templates and requires models to solve multiple instantiations of each. This design enforces consistency across structurally equivalent variants, mitigates contamination, and enhances robustness through bootstrapped metrics. We apply VAR-MATH to transform three popular benchmarks, AMC23, AIME24, and AIME25, into their symbolic counterparts, VAR-AMC23, VAR-AIME24, and VAR-AIME25. Experimental results show substantial performance drops for RL-trained models on these variabilized benchmarks, especially for smaller models, with average declines of 47.9% on AMC23, 58.8% on AIME24, and 72.9% on AIME25. These findings indicate that some existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms.

[454] What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov

Main category: cs.AI

TL;DR: The paper introduces a new benchmark and evaluation protocol to systematically assess KG-RAG methods under knowledge incompleteness, revealing current methods’ limited reasoning ability and reliance on memorization.

Details

Motivation: Current evaluation practices for KG-RAG methods are inadequate because existing benchmarks often contain questions that can be directly answered using existing triples in knowledge graphs, making it unclear whether models perform reasoning or simply retrieve answers. Additionally, inconsistent evaluation metrics and lenient answer matching criteria obscure meaningful comparisons.

Method: The authors introduce a general method for constructing benchmarks together with an evaluation protocol to systematically assess KG-RAG methods under knowledge incompleteness conditions.

Result: Empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.

Conclusion: There is a need for better evaluation frameworks to assess KG-RAG methods’ true reasoning capabilities, especially under knowledge incompleteness, as current methods show significant limitations in this area.

Abstract: Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.

[455] AI Compute Architecture and Evolution Trends

Bor-Sung Liang

Main category: cs.AI

TL;DR: The paper proposes a 7-layer AI compute architecture model to analyze AI development challenges and opportunities, tracing LLM evolution through these layers from hardware to applications.

Details

Motivation: AI development has shifted from academic research to practical applications but faces numerous challenges at various levels. The paper aims to provide a structured framework to analyze these challenges and opportunities systematically.

Method: The paper introduces a seven-layer model for AI compute architecture: Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer. It uses this model to analyze the three stages of LLM evolution and discusses key technologies and development trajectories for each layer.

Result: The 7-layer model provides a comprehensive framework for understanding AI development. It analyzes computing issues (Layers 1-2), LLM development paths (Layer 3), contextual memory impact (Layer 4), and AI agent evolution from single agents to ecosystems (Layers 5-7).

Conclusion: The proposed 7-layer architecture model offers a structured approach to analyze AI development challenges and opportunities across different levels, from hardware infrastructure to application ecosystems, providing insights into the evolution of large language models and AI systems.

Abstract: The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven-layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains the three stages in the evolution of large language models (LLMs) using the proposed 7-layer model. For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale-Up and Scale-Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI-based ecosystem, and their impact on the AI industry.

[456] Causal Consistency Regularization: Training Verifiably Sensitive Reasoning in Large Language Models

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.AI

TL;DR: CSR improves LLM reasoning faithfulness by penalizing models when logically invalid reasoning traces still produce correct answers, using operator-level interventions to enforce causal consistency.

Details

Motivation: Current LLMs can produce correct answers with flawed reasoning, undermining trustworthiness in high-stakes settings, because training objectives reward final-answer correctness rather than faithful intermediate reasoning.

Method: Counterfactual Sensitivity Regularization (CSR) applies operator-level interventions (e.g., swapping “+” with “-”) to reasoning traces to generate counterfactual rationales, then penalizes the model when these logically invalid traces still lead to the original answer. Implementation is efficient with warm-start curriculum and token-subset optimization (9% training overhead).

Result: CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, establishes new Pareto frontier for accuracy vs faithfulness trade-offs across arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop QA (HotpotQA), and code generation (MBPP). Transfers across model families with 94.2-96.7% success in structured domains.

Conclusion: CSR offers practical route to more reliable reasoning in structured domains (mathematics, formal logic, code) where operators are well-defined and verifiable, covering 40-60% of high-stakes reasoning deployments, and complements inference-time methods like self-consistency.

Abstract: Large language models can produce correct answers while relying on flawed reasoning traces, partly because common training objectives reward final-answer correctness rather than faithful intermediate reasoning. This undermines trustworthiness in high-stakes settings. We propose Counterfactual Sensitivity Regularization (CSR), a training paradigm that improves reasoning faithfulness by enforcing causal consistency between reasoning steps and outcomes. CSR automatically applies operator-level interventions to reasoning traces, such as swapping “+” with “-”, to generate minimally perturbed counterfactual rationales, and penalizes the model when these logically invalid traces still lead to the original answer. Our implementation is efficient, adding about 9 percent training overhead via a warm-start curriculum and token-subset optimization. We evaluate faithfulness using Counterfactual Outcome Sensitivity (COS), which measures how appropriately answers change under logical perturbations. Across arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop question answering (HotpotQA), and code generation (MBPP), CSR yields improved accuracy versus faithfulness trade-offs, establishing a new Pareto frontier. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, and transfers across model families with 94.2 to 96.7 percent success in structured domains. CSR also complements inference-time methods such as self-consistency. Overall, CSR offers a practical route to more reliable reasoning in structured domains, including mathematics, formal logic, and code, where operators are well-defined and verifiable, covering an estimated 40 to 60 percent of high-stakes reasoning deployments.

[457] Japanese Children’s Riddles as a Benchmark for Machine Insight and Metacognition

Masaharu Mizumoto, Dat Nguyen, Zhiheng Han, Jiyuan Fang, Heyuan Guan, Xingfu Li, Naoya Shiraishi, Yo Nakawake, Le Minh Nguyen

Main category: cs.AI

TL;DR: NazoNazo Benchmark uses Japanese children’s riddles to test insight-based reasoning in LLMs, revealing metacognitive bottlenecks where models can generate correct answers but fail to endorse them.

Details

Motivation: Existing benchmarks suffer from saturation and contamination issues, making it hard to measure genuine reasoning advances in LLMs. The authors want to create a low-cost, renewable test that specifically evaluates insight-based reasoning requiring representational shifts rather than knowledge recall.

Method: Created NazoNazo Benchmark using 201 Japanese children’s riddles that demand insight reasoning. Evaluated 38 frontier LLMs (2023-2025) and compared with human performance on a 120-item subset. Conducted thought-log analysis to examine reasoning processes and verification failures.

Result: Non-reasoning models averaged 7.6% accuracy, reasoning models 17.6%, and humans ~53%. Key finding: models sometimes generated correct candidates but failed to endorse them, indicating “verification failure” rather than lack of knowledge. Reasoning in Japanese didn’t necessarily improve accuracy, showing language understanding alone is insufficient for insight reasoning.

Conclusion: The benchmark reveals a metacognitive bottleneck in LLMs - they can’t recognize when they’re right. This provides a scalable, cross-linguistic testbed for studying machine insight, confidence calibration, and self-evaluation, offering a concrete target for developing AI metacognitive psychology and enhancing machine Aha! capability.

Abstract: Benchmark saturation and contamination have obscured genuine advances in reasoning for large language models (LLMs). We introduce NazoNazo Benchmark, a low-cost, renewable test built from Japanese children’s riddles that demand insight-based reasoning, or representational shifts rather than knowledge recall. We evaluate 38 frontier LLMs (2023-2025) on 201 riddles and a 120-item human-comparison subset, finding that non-reasoning models average 7.6%, reasoning models 17.6%, and humans ~53% accuracy. Importantly, thought-log analysis reveals that reasoning in Japanese did not necessarily improve accuracy, indicating that language understanding alone is insufficient for insight reasoning. Notably, models sometimes generated correct candidates but failed to endorse them, suggesting weak metacognitive control rather than a lack of knowledge. This “verification failure” indicates that CoT outputs can reflect genuine intermediate reasoning states rather than post-hoc rationalizations. By exposing this metacognitive bottleneck - models’ inability to recognize when they are right - the benchmark provides a scalable, cross-linguistic testbed for studying machine insight, confidence calibration, and self-evaluation. NazoNazo Benchmark thus offers not only a fresh challenge to current LLMs but also a concrete target for developing AI metacognitive psychology and enhancing machine Aha! capability.

[458] Benchmarking Deep Learning Convolutions on Energy-constrained CPUs

Enrique Galvez, Adrien Cassagne, Alix Munier, Manuel Bouyer

Main category: cs.AI

TL;DR: This paper benchmarks state-of-the-art convolution algorithms for CPU-based CNN inference, providing the first fair cross-vendor comparison of CPU energy consumption using high-resolution socket-level measurements.

Details

Motivation: Most prior studies focus on GPU/NPU optimization, leaving CPU implementations comparatively under-optimized. There's a need for fair benchmarking of embedded CPU inference across different vendors.

Method: Evaluated direct, GEMM-based, and Winograd convolution algorithms across modern CPUs from ARM, Intel, AMD, and NVIDIA. Used high-resolution socket-level power measurement platform and compared with MSR-based estimates.

Result: ARM Cortex-A78AE CPU with implicit GEMM convolution offers best latency-power trade-off: ResNet50v1.5 inference in 102 ms with 25.3W average power (2.58J). MSRs underestimate convolution power consumption by 10-30%.

Conclusion: CPU-based CNN inference can be highly efficient with proper algorithm selection. Socket-level power measurements are crucial for accurate energy evaluation, as MSR-based estimates significantly underestimate actual consumption.

Abstract: This work evaluates State-of-the-Art convolution algorithms for CPU-based CNN inference. Although most prior studies focus on GPUs or NPUs, CPU implementations remain comparatively under-optimized. Our first contribution is to provide fair benchmarking for embedded CPU inference. We evaluate direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM, Intel, AMD, and NVIDIA vendors, considering both latency and energy efficiency. To the best of our knowledge, this is the first study to present a fair, cross-vendor comparison of CPU energy consumption using a high-resolution socket-level measurement platform. To validate our methodology, we further compare socket-level power measurements with estimates derived from model-specific registers (MSRs), finding that MSRs underestimate the power consumption of convolution inference by 10–30%. Our results show that the ARM\R Cortex-A78AE CPU combined with an implicit GEMM convolution implementation offers the best trade-off between latency and power consumption, achieving ResNet50v1.5 inference in 102 ms with an average power of 25.3 W, corresponding to 2.58 J.

[459] AI and Consciousness

Eric Schwitzgebel

Main category: cs.AI

TL;DR: The paper provides a skeptical analysis of AI consciousness debates, arguing that mainstream theories will soon produce contradictory predictions about AI consciousness, leaving us unable to determine if AI systems are truly conscious or merely experiential blanks.

Details

Motivation: The motivation is to critically examine the current state of AI consciousness debates, highlighting that we're approaching a point where different mainstream theories will yield contradictory conclusions about AI consciousness, creating an epistemic impasse where we won't know if we're creating genuinely conscious AI or merely sophisticated but experientially empty systems.

Method: The paper employs a skeptical philosophical analysis, examining ten possibly essential features of consciousness, critiquing standard arguments for and against AI consciousness, and evaluating major theories including materialism, functionalism, global workspace theories, higher order theories, integrated information theory, and biological substrate arguments.

Result: The analysis concludes that none of the standard arguments for or against AI consciousness are decisive, and we will soon face a situation where some mainstream theories declare AI systems conscious while others deny it, leaving us in an epistemic fog regarding AI consciousness.

Conclusion: We are heading toward an epistemic impasse where we cannot determine AI consciousness due to conflicting mainstream theories, suggesting that current philosophical and scientific approaches are insufficient to resolve the question of whether AI systems can be genuinely conscious.

Abstract: This is a skeptical overview of the literature on AI consciousness. We will soon create AI systems that are conscious according to some influential, mainstream theories of consciousness but are not conscious according to other influential, mainstream theories of consciousness. We will not be in a position to know which theories are correct and whether we are surrounded by AI systems as richly and meaningfully conscious as human beings or instead only by systems as experientially blank as toasters. None of the standard arguments either for or against AI consciousness takes us far. Table of Contents Chapter One: Hills and Fog Chapter Two: What Is Consciousness? What Is AI? Chapter Three: Ten Possibly Essential Features of Consciousness Chapter Four: Against Introspective and Conceptual Arguments for Essential Features Chapter Five: Materialism and Functionalism Chapter Six: The Turing Test and the Chinese Room Chapter Seven: The Mimicry Argument Against AI Consciousness Chapter Eight: Global Workspace Theories and Higher Order Theories Chapter Nine: Integrated Information, Local Recurrence, Associative Learning, and Iterative Natural Kinds Chapter Ten: Does Biological Substrate Matter? Chapter Eleven: The Leapfrog Hypothesis and the Social Semi-Solution

[460] I Large Language Models possono nascondere un testo in un altro testo della stessa lunghezza

Antonio Norelli, Michael Bronstein

Main category: cs.AI

TL;DR: Calgacus is a simple protocol using LLMs to hide meaningful text inside completely different but coherent text of the same length, demonstrating radical decoupling of text from authorial intent and raising urgent AI safety concerns.

Details

Motivation: The paper addresses the concerning possibility that LLMs can be used to hide one text within another completely different but coherent text of the same length, enabling covert communication and deception. This capability further erodes trust in written communication already shaken by LLM chatbots.

Method: The authors present Calgacus, a simple and efficient protocol that uses Large Language Models to encode and decode hidden messages within seemingly ordinary text. They show that even modest 8-billion-parameter open-source LLMs are sufficient for high-quality results.

Result: The protocol works effectively with modest LLMs, allowing messages as long as an abstract to be encoded and decoded locally on a laptop in seconds. This demonstrates the practical feasibility of hiding meaningful content within completely different but plausible text.

Conclusion: The existence of such steganographic protocols represents a radical decoupling of text from authorial intent, raising urgent questions for AI safety and challenging our understanding of what it means for an LLM to “know” something. It enables concerning scenarios like covert deployment of unfiltered LLMs hidden within safe model responses.

Abstract: A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present Calgacus, a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.

Un testo di senso compiuto può essere nascosto all’interno di un altro testo completamente diverso, eppure coerente e plausibile, della stessa lunghezza. Ad esempio, un tweet che celebra un leader politico potrebbe celare un tweet che lo critica duramente, o un’anonima recensione di un prodotto potrebbe in realtà codificare un manoscritto segreto. Questa sconcertante possibilità è oggi alla nostra portata grazie ai Large Language Models (LLM); in questo articolo presentiamo Calgacus, un protocollo semplice ed efficiente per realizzarla. Mostriamo che anche modesti LLM open-source da 8 miliardi di parametri sono sufficienti per ottenere risultati di alta qualità, e che un messaggio lungo quanto questo abstract può essere codificato e decodificato su un comune portatile in pochi secondi. L’esistenza di tale protocollo dimostra un radicale disaccoppiamento del testo dall’intento del suo autore, erodendo ulteriormente la fiducia nella comunicazione scritta, già scossa dall’ascesa dei chatbot basati su LLMs. Illustriamo ciò con uno scenario concreto: un’azienda potrebbe offrire pubblicamente i servizi di un LLM senza filtri nascondendo le sue risposte all’interno di risposte apparentemente innocue generate da un LLM considerato sicuro. Questa possibilità solleva questioni urgenti per la sicurezza dell’Intelligenza Artificiale e sfida la nostra comprensione di cosa significhi, per un Large Language Model, sapere qualcosa.

[461] OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Hao Zheng, Zirui Pang, Ling li, Zhijie Deng, Yuhan Pu, Zhaowei Zhu, Xiaobo Xia, Jiaheng Wei

Main category: cs.AI

TL;DR: OFFSIDE is a novel benchmark for evaluating misinformation unlearning in multimodal LLMs using football transfer rumors, addressing limitations of existing benchmarks through comprehensive evaluation scenarios.

Details

Motivation: As MLLMs advance, data privacy concerns intensify, making machine unlearning critical. Existing MU benchmarks for MLLMs are limited by lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios that don't capture real-world complexity.

Method: Introduces OFFSIDE benchmark based on manually curated football transfer rumors dataset with 15.68K records for 80 players. Provides comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. Supports advanced settings like selective unlearning, corrective relearning, and crucially, unimodal unlearning.

Result: Evaluation of multiple baselines reveals: (1) Unimodal methods fail on multimodal rumors; (2) Unlearning efficacy driven by catastrophic forgetting; (3) All methods struggle with “visual rumors”; (4) Unlearned rumors easily recoverable; (5) All methods vulnerable to prompt attacks.

Conclusion: Results expose significant vulnerabilities in current approaches, highlighting the need for more robust multimodal unlearning solutions. The benchmark facilitates development of better MLLM unlearning techniques.

Abstract: Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity. However, existing MU benchmarks for MLLMs are limited by a lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios, which fail to capture the complexity of real-world applications. To facilitate the development of MLLMs unlearning and alleviate the aforementioned limitations, we introduce OFFSIDE, a novel benchmark for evaluating misinformation unlearning in MLLMs based on football transfer rumors. This manually curated dataset contains 15.68K records for 80 players, providing a comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. OFFSIDE supports advanced settings like selective unlearning and corrective relearning, and crucially, unimodal unlearning (forgetting only text data). Our extensive evaluation of multiple baselines reveals key findings: (1) Unimodal methods (erasing text-based knowledge) fail on multimodal rumors; (2) Unlearning efficacy is largely driven by catastrophic forgetting; (3) All methods struggle with “visual rumors” (rumors appear in the image); (4) The unlearned rumors can be easily recovered and (5) All methods are vulnerable to prompt attacks. These results expose significant vulnerabilities in current approaches, highlighting the need for more robust multimodal unlearning solutions. The code is available at https://github.com/zh121800/OFFSIDE

[462] Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro

Main category: cs.AI

TL;DR: Proposes Emotional Rationale Verifier (ERV) and Explanation Reward to improve consistency between emotion predictions and explanations in multimodal LLMs without architecture changes or extra annotations.

Details

Motivation: Current MLLMs generate emotion explanations that often diverge from target labels or contradict their own predictions, creating reliability issues in human-computer interaction where emotional intelligence and trust are crucial.

Method: Introduces Emotional Rationale Verifier (ERV) and Explanation Reward to guide models to produce reasoning explicitly consistent with target emotions during multimodal emotion recognition, without modifying model architecture or requiring additional video-description annotations.

Result: Significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on MAFW and DFEW datasets through extensive experiments and human evaluations.

Conclusion: The approach enhances alignment between explanations and predictions, empowering MLLMs to deliver emotionally coherent, trustworthy interactions, marking progress toward human-like HCI systems.

Abstract: The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

[463] ScRPO: From Errors to Insights

Lianrui Li, Dakuan Lu, Jiawei Shao, Xuelong Li

Main category: cs.AI

TL;DR: ScRPO is a reinforcement learning framework that enhances LLMs’ mathematical reasoning through iterative self-reflection and error correction, achieving significant performance gains over baselines.

Details

Motivation: To empower large language models with advanced mathematical reasoning capabilities through autonomous self-improvement, particularly in tasks with limited external feedback.

Method: Two-phase framework: (1) Trial-and-error learning via GRPO to collect incorrect responses into an “error pool”, and (2) Self-correction learning where the model introspectively analyzes and rectifies reasoning flaws from previous errors.

Result: Achieves average accuracies of 64.8% (1.5B model) and 77.8% (7B model) on challenging mathematical benchmarks (AIME, AMC, Olympiad, MATH-500, GSM8k), representing 6.0% and 3.2% improvements over vanilla baselines, outperforming DAPO and GRPO.

Conclusion: ScRPO establishes a robust paradigm for enabling autonomous self-improvement in AI systems, particularly effective for mathematical reasoning tasks with limited external feedback.

Abstract: We introduce Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to empower large language models with advanced mathematical reasoning capabilities through iterative self-reflection and error correction. The ScRPO framework operates in two distinct phases: (1) Trial-and-error learning stage, where the model is trained via GRPO, and incorrect responses are collected to form an “error pool”; and (2) Self-correction learning stage, which guides the model to introspectively analyze and rectify the reasoning flaws behind its previous errors. Extensive evaluations across challenging mathematical benchmarks, including AIME, AMC, Olympiad, MATH-500, and GSM8k, validate the efficacy of our approach. Using DeepSeek-R1-Distill-Qwen-1.5B and 7B as backbones, ScRPO achieves average accuracies of 64.8% and 77.8%, respectively. This represents a significant improvement of 6.0% and 3.2% over vanilla baselines, consistently outperforming strong post-training methods such as DAPO and GRPO. These findings establish ScRPO as a robust paradigm for enabling autonomous self-improvement in AI systems, particularly in tasks with limited external feedback.

[464] Alignment-Aware Quantization for LLM Safety

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak

Main category: cs.AI

TL;DR: AAQ integrates alignment-preserving contrastive loss into post-training quantization to maintain LLM safety while achieving efficient 4-bit quantization.

Details

Motivation: Conventional PTQ focuses only on perplexity optimization, which can compromise LLM safety alignment - creating a fundamental conflict between efficiency and safety requirements.

Method: Alignment-Aware Quantization (AAQ) with Alignment-Preserving Contrastive (APC) loss that encourages quantized models to mimic safe instruction-tuned versions while diverging from unaligned pre-trained counterparts.

Result: AAQ enables robust 4-bit (W4A4) quantization across diverse model families while preserving safety alignment, using only standard calibration data without specialized safety datasets.

Conclusion: AAQ resolves the critical trade-off between efficiency and safety, enabling both efficient and trustworthy LLMs through a novel quantization approach that explicitly preserves alignment.

Abstract: Safety and efficiency are paramount yet often conflicting requirements for deploying Large Language Models (LLMs). While LLMs are trained to follow human alignment for safety, Post-Training Quantization (PTQ) is applied afterward to ensure efficiency. Here we identify a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. To address this, we propose \textbf{Alignment-Aware Quantization (AAQ)}, a novel approach that integrates an \textbf{Alignment-Preserving Contrastive (APC)} loss into the PTQ pipeline. Our method explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. AAQ achieves robust safety alignment without specialized safety-focused datasets, using only standard calibration data. We show that AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.

[465] FaithAct: Faithfulness Planning and Acting in MLLMs

Junxian Li, Xinyue Xu, Sai Ma, Di Zhang, Seth Lazar, Sichao Li

Main category: cs.AI

TL;DR: Faithful-First RPA framework improves MLLM reasoning faithfulness by using FaithEvi for step-wise supervision and FaithAct for planning faithfulness-aware actions during inference, achieving up to 24% better perceptual faithfulness without accuracy loss.

Details

Motivation: Multimodal Large Language Models (MLLMs) often generate unfaithful reasoning chains that drift from visual evidence or contradict final predictions, highlighting the need for improved faithfulness in multimodal reasoning.

Method: Proposes Faithful-First Reasoning, Planning, and Acting (RPA) framework with two components: FaithEvi evaluates faithfulness of intermediate reasoning (step-wise and chain-level supervision), and FaithAct uses these signals to plan and execute faithfulness-aware actions during inference.

Result: Experiments across multiple multimodal reasoning benchmarks show faithful-first RPA improves perceptual faithfulness by up to 24% over prompt-based and tool-augmented reasoning frameworks, without degrading task accuracy.

Conclusion: Treating faithfulness as a guiding principle leads to perceptually faithful reasoning trajectories and mitigates hallucination behavior, establishing a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.

Abstract: Multimodal Large Language Models (MLLMs) frequently suffer from unfaithfulness, generating reasoning chains that drift from visual evidence or contradict final predictions. We propose Faithful-First Reasoning, Planning, and Acting (RPA) framework in which FaithEvi provides step-wise and chain-level supervision by evaluating the faithfulness of intermediate reasoning, and FaithAct uses these signals to plan and execute faithfulness-aware actions during inference. Experiments across multiple multimodal reasoning benchmarks show that faithful-first RPA improves perceptual faithfulness by up to 24% over prompt-based and tool-augmented reasoning frameworks, without degrading task accuracy. Our analysis shows that treating faithfulness as a guiding principle perceptually faithful reasoning trajectories and mitigates hallucination behavior. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.

[466] UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models

Shouang Wei, Min Zhang, Xin Lin, Bo Jiang, Kun Kuang, Zhongxiang Dai

Main category: cs.AI

TL;DR: UCO (Unidirectional Cognitive Optimization) is a reinforcement learning method that improves LLMs as intelligent tutors by using two synergistic reward functions to track genuine student understanding and adapt teaching to individual cognitive zones.

Details

Motivation: Current LLM fine-tuning methods for educational tutoring only learn surface teaching patterns without dynamic adaptation. Existing RL approaches fail to distinguish genuine student understanding from answer echoing and cannot perceive evolving cognitive states in real-time dialogue.

Method: UCO uses a multi-turn interactive reinforcement learning paradigm with two key reward functions: Progress Reward captures students’ cognitive advancement from confusion to comprehension, and Scaffold Reward dynamically identifies each student’s Zone of Proximal Development (ZPD) to maintain teaching within this optimal learning zone.

Result: UCO outperforms 11 baseline models on BigMath and MathTutorBench benchmarks, achieving performance comparable to advanced closed-source models while using models of equivalent scale.

Conclusion: The proposed UCO method effectively addresses limitations in current educational LLM approaches by enabling genuine understanding assessment and dynamic cognitive adaptation, advancing LLMs from answer providers to true intelligent tutors.

Abstract: Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo teacher-provided answers during interaction. Second, they cannot perceive students’ evolving cognitive states in real time through interactive dialogue, thus failing to adapt teaching strategies to match students’ cognitive levels dynamically. We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges. UCO uses a multi-turn interactive reinforcement learning paradigm where the innovation lies in two synergistic reward functions: the Progress Reward captures students’ cognitive advancement, evaluating whether students truly transition from confusion to comprehension, while the Scaffold Reward dynamically identifies each student’s Zone of Proximal Development (ZPD), encouraging teachers to maintain productive teaching within this zone. We evaluate UCO by comparing it against 11 baseline models on BigMath and MathTutorBench benchmarks. Experimental results demonstrate that our UCO model outperforms all models of equivalent scale and achieves performance comparable to advanced closed-source models. The code and data are available at https://github.com/Mind-Lab-ECNU/UCO.

[467] Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Pei Yang, Ke Zhang, Ji Wang, Xiao Chen, Yuxin Tang, Eric Yang, Lynn Ai, Bill Shi

Main category: cs.AI

TL;DR: CRM replaces single reward models with a team of specialist evaluators for better robustness and interpretability in RLHF, using a benchmark called rewardBench for training and assessment.

Details

Motivation: Conventional reward models struggle with optimizing multiple conflicting preference dimensions (factuality, helpfulness, safety) and lack transparency in scoring decisions, creating challenges for RLHF.

Method: Decomposes preference evaluation into domain-specific agents producing partial signals, combined with global evaluators (ranker-based, embedding-similarity). A centralized aggregator fuses signals at each timestep, balancing factors like step-wise correctness and multi-agent agreement. Uses advantage-based updates (GAE) for policy optimization and value model regression.

Result: CRM provides a practical, modular framework for more transparent reward modeling and stable optimization without requiring additional human annotations beyond those used to train the evaluators.

Conclusion: CRM and rewardBench together offer a path to improved robustness and interpretability in RLHF through collaborative, multi-agent reward modeling that addresses limitations of conventional single reward models.

Abstract: We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.

[468] Auditing Human Decision-Making in High-Stakes Environments via Prescriptive AI: A Stress-Test on Real-Time Tactical Management

Pedro Passos, Patrick Moratori

Main category: cs.AI

TL;DR: The paper introduces a Prescriptive AI framework that audits human judgment in real-time decision-making, revealing cognitive biases like outcome dependency and status quo bias in elite soccer management, rather than just mimicking historical human behavior.

Details

Motivation: Current AI models typically mimic historical human behavior, inheriting cognitive biases and limiting their utility for normative improvement in high-stakes decision-making. There's a need for AI systems that can audit human judgment to reveal structural flaws in reasoning rather than just automating or predicting decisions.

Method: The authors introduce a Prescriptive AI framework that decouples decision quality from stochastic outcomes, quantifying “decision latency” and status quo bias. They analyze 2018 FIFA World Cup data to expose critical risk states that human experts systematically overlook due to outcome bias.

Result: The system reveals critical risk states in elite soccer management, such as performance collapse following salient positive events (e.g., an assist), which human experts systematically overlook due to outcome bias. The framework demonstrates that interpretable auditing systems can expose structural flaws in human reasoning.

Conclusion: The Prescriptive AI approach establishes a new paradigm for Human-AI interaction that prioritizes epistemic accountability over predictive mimicry, particularly valuable in safety-critical domains where cognitive biases compromise decision-making quality.

Abstract: High-stakes decision-making is often compromised by cognitive biases and outcome dependency. Current AI models typically mimic historical human behavior, inheriting these biases and limiting their utility for normative improvement. Here, we introduce a Prescriptive AI framework designed to audit, rather than automate, human judgment in real-time environments. By decoupling decision quality from stochastic outcomes, we quantify “decision latency” and status quo bias in elite soccer management - a high-pressure adversarial domain. Analyzing 2018 FIFA World Cup data, our system exposes critical risk states, such as performance collapse following salient positive events (e.g., an assist), which human experts systematically overlook due to outcome bias. These findings demonstrate that interpretable auditing systems can reveal structural flaws in human reasoning that predictive models obscure. This approach establishes a paradigm for Human-AI interaction prioritizing epistemic accountability over predictive mimicry in safety-critical domains.

[469] A Fast Anti-Jamming Cognitive Radar Deployment Algorithm Based on Reinforcement Learning

Wencheng Cai, Xuchao Gao, Congying Han, Mingqiang Li, Tiande Guo

Main category: cs.AI

TL;DR: FARDA uses deep reinforcement learning for fast radar deployment against jamming, achieving comparable coverage to evolutionary algorithms but 7,000× faster.

Details

Motivation: Current radar deployment methods rely on slow evolutionary algorithms that are time-consuming and prone to local optima, which is problematic for fast deployment needs in modern warfare against jamming threats.

Method: Models radar deployment as end-to-end task using deep reinforcement learning with integrated neural modules for heatmap perception and a novel reward format.

Result: Achieves coverage comparable to evolutionary algorithms while deploying radars approximately 7,000 times faster; ablation studies confirm necessity of all FARDA components.

Conclusion: FARDA provides a highly efficient alternative to evolutionary algorithms for fast anti-jamming radar deployment, addressing critical speed requirements in modern warfare scenarios.

Abstract: The fast deployment of cognitive radar to counter jamming remains a critical challenge in modern warfare, where more efficient deployment leads to quicker detection of targets. Existing methods are primarily based on evolutionary algorithms, which are time-consuming and prone to falling into local optima. We tackle these drawbacks via the efficient inference of neural networks and propose a brand new framework: Fast Anti-Jamming Radar Deployment Algorithm (FARDA). We first model the radar deployment problem as an end-to-end task and design deep reinforcement learning algorithms to solve it, where we develop integrated neural modules to perceive heatmap information and a brand new reward format. Empirical results demonstrate that our method achieves coverage comparable to evolutionary algorithms while deploying radars approximately 7,000 times faster. Further ablation experiments confirm the necessity of each component of FARDA.

[470] CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang

Main category: cs.AI

TL;DR: CangLing-KnowFlow is a unified intelligent agent framework for remote sensing that integrates expert knowledge, dynamic workflow adjustment, and evolutionary memory to automate complex Earth observation tasks.

Details

Motivation: Existing automated remote sensing systems are task-specific and lack a unified framework for managing diverse end-to-end workflows from data preprocessing to advanced interpretation across various applications.

Method: The framework integrates three key components: 1) Procedural Knowledge Base (PKB) with 1,008 expert-validated workflow cases across 162 RS tasks, 2) Dynamic Workflow Adjustment for autonomous diagnosis and recovery during runtime failures, and 3) Evolutionary Memory Module for continuous learning from execution events.

Result: Evaluated on KnowFlow-Bench (324 workflows from real-world applications) with 13 LLM backbones, CangLing-KnowFlow surpassed Reflexion baseline by at least 4% in Task Success Rate across all complex tasks.

Conclusion: CangLing-KnowFlow demonstrates great potential as a robust, efficient, and scalable automated solution for complex Earth observation challenges by leveraging expert knowledge into adaptive and verifiable procedures.

Abstract: The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows–from data preprocessing to advanced interpretation–across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent’s knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

Yosuke Taniuchi, Chie Hieida, Atsushi Noritake, Kazushi Ikeda, Masaki Isoda

Main category: cs.AI

TL;DR: Monkeys use objective reward differences rather than inferring others’ subjective valuations during social comparison, as shown by computational modeling of primate social cognition.

Details

Motivation: To understand the computational mechanisms behind social comparison in primates - specifically whether monkeys recognize objective reward differences or infer others' subjective valuations when evaluating their own rewards.

Method: Developed three computational models with varying social information processing: Internal Prediction Model (IPM) infers partner’s subjective values, No Comparison Model (NCM) disregards partner information, and External Comparison Model (ECM) directly incorporates partner’s objective rewards. Used multi-layered, multimodal latent Dirichlet allocation and trained models on monkey behavior, rewards, and conditioned stimuli data.

Result: ECM achieved highest classification score (Rand Index 0.88 vs. 0.79 for IPM), indicating social comparison relies on objective reward differences rather than inferences about subjective states.

Conclusion: Social comparison in monkeys operates through direct comparison of objective rewards rather than through complex inference of others’ subjective valuations, providing computational insight into primate social cognition mechanisms.

Abstract: Social comparison$\unicode{x2014}$the process of evaluating one’s rewards relative to others$\unicode{x2014}$plays a fundamental role in primate social cognition. However, it remains unknown from a computational perspective how information about others’ rewards affects the evaluation of one’s own reward. With a constructive approach, this study examines whether monkeys merely recognize objective reward differences or, instead, infer others’ subjective reward valuations. We developed three computational models with varying degrees of social information processing: an Internal Prediction Model (IPM), which infers the partner’s subjective values; a No Comparison Model (NCM), which disregards partner information; and an External Comparison Model (ECM), which directly incorporates the partner’s objective rewards. To test model performance, we used a multi-layered, multimodal latent Dirichlet allocation. We trained the models on a dataset containing the behavior of a pair of monkeys, their rewards, and the conditioned stimuli. Then, we evaluated the models’ ability to classify subjective values across pre-defined experimental conditions. The ECM achieved the highest classification score in the Rand Index (0.88 vs. 0.79 for the IPM) under our settings, suggesting that social comparison relies on objective reward differences rather than inferences about subjective states.

[472] HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare

Aditya Siddhant

Main category: cs.AI

TL;DR: HARBOR is a behavioral health language model that predicts mood/risk scores (-3 to +3) using multimodal patient data, outperforming traditional ML and proprietary LLMs with 69% accuracy.

Details

Motivation: Behavioral healthcare risk assessment is challenging due to multimodal patient data and temporal dynamics of mood disorders. While LLMs show strong reasoning, their effectiveness in structured clinical risk scoring remains unclear.

Method: Introduces HARBOR (behavioral health aware language model) to predict Harbor Risk Score (HRS) on -3 to +3 scale. Also releases PEARL dataset with 4 years of monthly observations from 3 patients containing physiological, behavioral, and self-reported mental health signals.

Result: HARBOR outperforms classical baselines and off-the-shelf LLMs, achieving 69% accuracy compared to 54% for logistic regression and 29% for the strongest proprietary LLM baseline.

Conclusion: HARBOR demonstrates superior performance in behavioral health risk assessment compared to traditional ML and existing LLMs, showing promise for structured clinical risk scoring applications.

Abstract: Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.

[473] Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models

Gökdeniz Gülmez

Main category: cs.AI

TL;DR: Gabliteration is a neural weight modification technique that uses adaptive multi-directional projections with regularized layer selection to modify specific behaviors while minimizing quality degradation in unrelated domains.

Details

Motivation: Existing weight modification methods compromise model quality when trying to modify specific behavioral patterns. The paper aims to address this fundamental limitation by developing a technique that can modify targeted behaviors without degrading overall model performance in other domains.

Method: Gabliteration implements adaptive multi-directional projections with regularized layer selection, featuring dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms to achieve theoretically superior weight modification.

Result: The method was validated through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.

Conclusion: Gabliteration advances beyond traditional ablation methods by providing a more sophisticated approach to neural weight modification that maintains model quality while enabling targeted behavioral modifications.

Abstract: We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.

[474] SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?

Kenny Workman, Zhen Yang, Harihara Muralidharan, Hannah Le

Main category: cs.AI

TL;DR: SpatialBench is a benchmark of 146 verifiable problems from real spatial transcriptomics workflows to evaluate AI agents’ ability to extract biological insights from messy spatial datasets.

Details

Motivation: Spatial transcriptomics assays are increasing in scale and complexity, creating computational bottlenecks. While AI agents have improved at software engineering and general data analysis, it's unclear if they can extract biological insights from messy, real-world spatial datasets.

Method: Created SpatialBench benchmark with 146 verifiable problems derived from practical spatial analysis workflows across five spatial technologies and seven task categories. Each problem provides experimental data snapshots and deterministic graders to evaluate recovery of key biological results.

Result: Frontier models show low base accuracy (20-38% across model families) with strong model-task and model-platform interactions. Harness design (tools, prompts, control flow, execution environment) has large empirical effect on performance.

Conclusion: SpatialBench serves as both measurement tool and diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly, highlighting the need to evaluate and improve harness design as first-class objects.

Abstract: Spatial transcriptomics assays are rapidly increasing in scale and complexity, making computational analysis a major bottleneck in biological discovery. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world spatial datasets. We introduce SpatialBench, a benchmark of 146 verifiable problems derived from practical spatial analysis workflows spanning five spatial technologies and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on frontier models shows that base model accuracy remains low (20-38% across model families), with strong model-task and model-platform interactions. Harness design has a large empirical effect on performance, indicating that tools, prompts, control flow, and execution environment should be evaluated and improved as first-class objects. SpatialBench serves both as a measurement tool and a diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly.

[475] TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, Yong Liu

Main category: cs.AI

TL;DR: TravelBench: A comprehensive benchmark for evaluating LLM agents on real-world travel planning tasks, covering single-turn, multi-turn, and unsolvable scenarios with practical tools and user profiles.

Details

Motivation: Existing travel planning benchmarks have limitations: limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and lack of clear evaluation of agents' capability boundaries. Need for a fully real-world travel planning benchmark.

Method: Collect real user queries, profiles, and tools from actual scenarios. Construct three subtasks: Single-Turn (autonomous problem solving), Multi-Turn (interactive requirement refinement), and Unsolvable (recognizing ability limits). Build sandbox environment with ten cached travel tools for stable evaluation.

Result: Created TravelBench benchmark with systematic verification demonstrating stability. Evaluated multiple LLMs on the benchmark and conducted in-depth analysis of their behaviors and performance.

Conclusion: TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning, addressing real-world needs and capability assessment.

Abstract: Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users’ implicit preferences in multi-turn conversations, and a lack of clear evaluation of agents’ capability boundaries. To mitigate these gaps, we propose \textbf{TravelBench}, a benchmark for fully real-world travel planning. We collect user queries, user profile and tools from real scenarios, and construct three subtasks-Single-Turn, Multi-Turn, and Unsolvable-to evaluate agent’s three core capabilities in real settings: (1) solving problems autonomously, (2) interacting with users over multiple turns to refine requirements, and (3) recognizing the limits of own abilities. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment that integrates ten travel-related tools. Agents can combine these tools to solve most practical travel planning problems, and our systematic verification demonstrates the stability of the proposed benchmark. We further evaluate multiple LLMs on TravelBench and conduct an in-depth analysis of their behaviors and performance. TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning.\footnote{Our code and data will be available after internal review.

[476] Multimodal Fact-Checking: An Agent-based Approach

Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

Main category: cs.AI

TL;DR: RW-Post dataset provides real-world multimodal misinformation with annotated reasoning and evidence, enabling AgentFact framework to improve fact-checking accuracy and interpretability through collaborative agent workflow.

Details

Motivation: Existing multimodal fact-checking systems have limited reasoning and shallow evidence utilization due to lack of dedicated datasets with complete real-world misinformation instances, annotated reasoning processes, and verifiable evidence.

Method: Introduce RW-Post dataset aligning real-world multimodal claims with original social media posts, including detailed reasoning and explicitly linked evidence extracted via LLM-assisted pipeline. Propose AgentFact framework with five specialized agents handling strategy planning, evidence retrieval, visual analysis, reasoning, and explanation generation through iterative workflow.

Result: Extensive experiments show synergy between RW-Post and AgentFact substantially improves both accuracy and interpretability of multimodal fact-checking compared to existing approaches.

Conclusion: The combination of high-quality explainable dataset (RW-Post) and agent-based framework (AgentFact) addresses key limitations in multimodal fact-checking by providing comprehensive verification capabilities and human-like reasoning workflows.

Abstract: The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.

[477] The Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction

Haoyu Pei, Zhongyang Liu, Xiangyi Xiao, Xiaocong Du, Suting Hong, Kunpeng Zhang, Haipeng Zhang

Main category: cs.AI

TL;DR: MIRAGE-VC is a multi-perspective retrieval-augmented generation framework that predicts venture capital success by selecting high-value graph paths and fusing heterogeneous evidence through explicit reasoning.

Details

Motivation: VC investments have high failure rates with few outsized returns. Predicting startup success requires synthesizing complex relational evidence through explicit reasoning, but existing methods lack this capability. Traditional ML/GNNs lack reasoning, LLMs have modality mismatch with graphs, and current graph-LLM methods focus on in-graph tasks while VC prediction is off-graph.

Method: MIRAGE-VC uses information-gain-driven path retriever to iteratively select high-value neighbors, distilling investment networks into compact chains for explicit reasoning. Multi-agent architecture integrates three evidence streams via learnable gating mechanism based on company attributes, addressing path explosion and heterogeneous evidence fusion challenges.

Result: Under strict anti-leakage controls, achieves +5.0% F1 and +16.6% PrecisionAt5 improvements. Framework also applicable to other off-graph prediction tasks like recommendation and risk assessment.

Conclusion: MIRAGE-VC successfully addresses the core challenge of selecting graph paths that maximize predictor performance on external objectives while enabling step-by-step reasoning for VC prediction, demonstrating effectiveness through significant performance improvements.

Abstract: Most venture capital (VC) investments fail, while a few deliver outsized returns. Accurately predicting startup success requires synthesizing complex relational evidence, including company disclosures, investor track records, and investment network structures, through explicit reasoning to form coherent, interpretable investment theses. Traditional machine learning and graph neural networks both lack this reasoning capability. Large language models (LLMs) offer strong reasoning but face a modality mismatch with graphs. Recent graph-LLM methods target in-graph tasks where answers lie within the graph, whereas VC prediction is off-graph: the target exists outside the network. The core challenge is selecting graph paths that maximize predictor performance on an external objective while enabling step-by-step reasoning. We present MIRAGE-VC, a multi-perspective retrieval-augmented generation framework that addresses two obstacles: path explosion (thousands of candidate paths overwhelm LLM context) and heterogeneous evidence fusion (different startups need different analytical emphasis). Our information-gain-driven path retriever iteratively selects high-value neighbors, distilling investment networks into compact chains for explicit reasoning. A multi-agent architecture integrates three evidence streams via a learnable gating mechanism based on company attributes. Under strict anti-leakage controls, MIRAGE-VC achieves +5.0% F1 and +16.6% PrecisionAt5, and sheds light on other off-graph prediction tasks such as recommendation and risk assessment. Code: https://anonymous.4open.science/r/MIRAGE-VC-323F.

[478] Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Xin Lin, Chonghuan Liu, ZhenDong Liu, Zhiqiang Lv, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng

Main category: cs.AI

TL;DR: ALE is an end-to-end ecosystem for developing agentic LLMs, featuring ROLL for weight optimization, ROCK for environment management, and iFlow CLI for context engineering. They release ROME agent trained on 1M+ trajectories with novel IPA algorithm for long-horizon training.

Details

Motivation: The open-source community lacks a principled, end-to-end ecosystem for developing agentic LLMs that can operate in real-world environments over multiple turns, taking actions, observing outcomes, and iteratively refining artifacts.

Method: ALE consists of three components: ROLL (post-training weight optimization), ROCK (sandbox environment manager for trajectory generation), and iFlow CLI (agent framework for context engineering). They use data composition protocols for synthesizing complex behaviors and introduce IPA (Interaction-Perceptive Agentic Policy Optimization) which assigns credit over semantic interaction chunks rather than individual tokens.

Result: They release ROME, an open-source agent trained on over one million trajectories, and introduce Terminal Bench Pro with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench.

Conclusion: ALE provides an effective foundational infrastructure for agentic model development, as proven by ROME’s strong benchmark performance, addressing the gap in open-source agent development ecosystems.

Abstract: Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic model. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME, an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-Perceptive Agentic Policy Optimization (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of ALE.

cs.SD

[479] Index-ASR Technical Report

Zheshu Song, Lu Wang, Wei Deng, Zhuo Yang, Yong Wu, Bin Xia

Main category: cs.SD

TL;DR: Index-ASR is an LLM-based speech recognition system that addresses hallucination errors and limited contextual customization by integrating LLMs with large-scale training data enriched with noise and context.

Details

Motivation: Existing LLM-based ASR systems suffer from two critical limitations: 1) hallucination errors producing excessively long/repetitive outputs not grounded in acoustic input, and 2) limited support for flexible, fine-grained contextual customization.

Method: Index-ASR integrates LLMs with large-scale training data enriched with background noise and contextual information to simultaneously enhance robustness and support customizable hotword recognition.

Result: Index-ASR achieves strong performance on both open-source benchmarks and in-house test sets, demonstrating robustness and practicality for real-world ASR applications.

Conclusion: Index-ASR successfully addresses key limitations of existing LLM-based ASR systems by improving robustness against hallucinations while enabling flexible contextual customization, making it practical for real-world deployment.

Abstract: Automatic speech recognition (ASR) has witnessed remarkable progress in recent years, largely driven by the emergence of LLM-based ASR paradigm. Despite their strong performance on a variety of open-source benchmarks, existing LLM-based ASR systems still suffer from two critical limitations. First, they are prone to hallucination errors, often generating excessively long and repetitive outputs that are not well grounded in the acoustic input. Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. The core idea of Index-ASR lies in the integration of LLM and large-scale training data enriched with background noise and contextual information. Experimental results show that our Index-ASR achieves strong performance on both open-source benchmarks and in-house test sets, highlighting its robustness and practicality for real-world ASR applications.

[480] IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection

Jiajie Zhu, Xia Du, Xiaoyuan Liu, Jizhe Zhou, Qizhen Xu, Zheng Lin, Chi-Man Pun

Main category: cs.SD

TL;DR: IO-RAE framework protects audio privacy using reversible adversarial examples that generate misleading content via LLMs, achieving 96.5% targeted and 100% untargeted misguidance rates while maintaining near-lossless audio recovery.

Details

Motivation: Widespread adoption of speech recognition technology creates privacy risks as audio data is vulnerable to unauthorized exposure and analysis, requiring protection methods that don't degrade audio quality.

Method: Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework uses large language models to generate contextually coherent misleading content, plus Cumulative Signal Attack targeting low-frequency signals to reduce high-frequency noise.

Result: Achieved 96.5% targeted misguidance rate and 100% untargeted misguidance rate across multiple ASR models; recovered audio quality scored 4.45 PESQ with 0% ASR error rate, indicating near-lossless recovery.

Conclusion: IO-RAE framework effectively protects sensitive audio privacy through reversible adversarial examples, demonstrating practical applicability with high misguidance rates and excellent audio recovery quality.

Abstract: The rapid advancements in artificial intelligence have significantly accelerated the adoption of speech recognition technology, leading to its widespread integration across various applications. However, this surge in usage also highlights a critical issue: audio data is highly vulnerable to unauthorized exposure and analysis, posing significant privacy risks for businesses and individuals. This paper introduces an Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework, the pioneering method designed to safeguard audio privacy using reversible adversarial examples. IO-RAE leverages large language models to generate misleading yet contextually coherent content, effectively preventing unauthorized eavesdropping by humans and Automatic Speech Recognition (ASR) systems. Additionally, we propose the Cumulative Signal Attack technique, which mitigates high-frequency noise and enhances attack efficacy by targeting low-frequency signals. Our approach ensures the protection of audio data without degrading its quality or our ability. Experimental evaluations demonstrate the superiority of our method, achieving a targeted misguidance rate of 96.5% and a remarkable 100% untargeted misguidance rate in obfuscating target keywords across multiple ASR models, including a commercial black-box system from Google. Furthermore, the quality of the recovered audio, measured by the Perceptual Evaluation of Speech Quality score, reached 4.45, comparable to high-quality original recordings. Notably, the recovered audio processed by ASR systems exhibited an error rate of 0%, indicating nearly lossless recovery. These results highlight the practical applicability and effectiveness of our IO-RAE framework in protecting sensitive audio privacy.

[481] Diffusion Timbre Transfer Via Mutual Information Guided Inpainting

Ching Ho Lee, Javier Nistal, Stefan Lattner, Marco Pasini, George Fazekas

Main category: cs.SD

TL;DR: Lightweight inference-time method for timbre transfer using pre-trained latent diffusion models with noise injection and early-step clamping, requiring no additional training.

Details

Motivation: To enable timbre transfer (changing instrument identity) in music audio using pre-trained models without expensive retraining, making style transfer more accessible and efficient.

Method: Two key techniques: (1) dimension-wise noise injection targeting latent channels most informative of instrument identity, and (2) early-step clamping mechanism that re-imposes input’s melodic/rhythmic structure during reverse diffusion. Works directly on audio latents and is compatible with text/audio conditioning like CLAP.

Result: The method demonstrates effective timbre transfer while preserving musical structure, shows trade-offs between timbral change and structural preservation, and enables meaningful steering of pre-trained models for style-transfer use cases.

Conclusion: Simple inference-time controls can effectively repurpose pre-trained latent diffusion models for timbre transfer tasks without additional training, making style transfer more practical and accessible.

Abstract: We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input’s melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.

[482] UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, Zhiyuan Liu

Main category: cs.SD

TL;DR: UltraEval-Audio is a unified evaluation framework for audio foundation models that addresses three major challenges in audio evaluation: lack of unified framework, inadequate audio codec evaluation, and English-centric speech benchmarks.

Details

Motivation: The rapid development of audio foundation models has created a critical bottleneck due to lack of comprehensive evaluation. Current audio evaluation faces three major challenges: 1) scattered datasets and code hindering fair cross-model comparison, 2) lack of holistic evaluation methodology for audio codecs, and 3) English-centric speech benchmarks making Chinese performance assessment difficult.

Method: Introduces UltraEval-Audio with modular architecture supporting 10 languages and 14 core task categories, integrating 24 mainstream models and 36 benchmarks. Features one-command evaluation and real-time leaderboards. Proposes comprehensive audio codec evaluation across semantic accuracy, timbre fidelity, and acoustic quality. Creates two new Chinese benchmarks: SpeechCMMLU and SpeechHSK for Chinese knowledge and language fluency assessment.

Result: UltraEval-Audio provides a unified framework that addresses all three identified challenges, offering a transparent, efficient, and fair platform for comparing audio models across multiple languages and tasks.

Conclusion: UltraEval-Audio aims to accelerate progress in audio foundation models by providing academia and industry with a comprehensive evaluation platform that enables fair comparison, addresses multilingual assessment needs, and establishes proper evaluation methodologies for audio codecs.

Abstract: The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models’ performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval-Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one-command evaluation feature, accompanied by real-time public leaderboards. For the second challenge, UltraEval-Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval-Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval-Audio.

[483] SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning

Peidong Wang, Zhiming Ma, Xin Dai, Yongkang Liu, Shi Feng, Xiaocui Yang, Wenxing Hu, Zhihao Wang, Mingjun Pan, Li Yuan, Daling Wang

Main category: cs.SD

TL;DR: SAFE-QAQ is an end-to-end audio-based fraud detection framework that eliminates ASR transcription errors, uses rule-based slow-thinking mechanisms to capture fine-grained audio details, and enables dynamic risk assessment during live calls for early fraud prevention.

Details

Motivation: Existing fraud detection methods rely on transcribed text, which suffers from ASR errors and misses crucial acoustic cues like vocal tone and environmental context, limiting effectiveness against complex deceptive strategies.

Method: Proposes SAFE-QAQ framework: 1) End-to-end audio processing to eliminate transcription errors, 2) Rule-based slow-thinking reward mechanisms for hierarchical reasoning to identify fraud-indicative patterns, 3) Dynamic risk assessment during live calls for early detection.

Result: Experiments on TeleAntiFraud-Bench show dramatic improvements in accuracy, inference efficiency, and real-time processing. Currently deployed analyzing over 70,000 calls daily, effectively automating fraud detection and reducing human workload and financial losses.

Conclusion: SAFE-QAQ addresses limitations of text-based fraud detection by leveraging audio cues directly, providing a comprehensive framework that improves detection performance while enabling real-time prevention of fraudulent activities.

Abstract: Existing fraud detection methods predominantly rely on transcribed text, suffering from ASR errors and missing crucial acoustic cues like vocal tone and environmental context. This limits their effectiveness against complex deceptive strategies. To address these challenges, we first propose \textbf{SAFE-QAQ}, an end-to-end comprehensive framework for audio-based slow-thinking fraud detection. First, the SAFE-QAQ framework eliminates the impact of transcription errors on detection performance. Secondly, we propose rule-based slow-thinking reward mechanisms that systematically guide the system to identify fraud-indicative patterns by accurately capturing fine-grained audio details, through hierarchical reasoning processes. Besides, our framework introduces a dynamic risk assessment framework during live calls, enabling early detection and prevention of fraud. Experiments on the TeleAntiFraud-Bench demonstrate that SAFE-QAQ achieves dramatic improvements over existing methods in multiple key dimensions, including accuracy, inference efficiency, and real-time processing capabilities. Currently deployed and analyzing over 70,000 calls daily, SAFE-QAQ effectively automates complex fraud detection, reducing human workload and financial losses. Code: https://anonymous.4open.science/r/SAFE-QAQ.

[484] OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech

Yong Ren, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Zhengqi Wen, Hao Gu, Le Xu, Ye Bai

Main category: cs.SD

TL;DR: OV-InstructTTS introduces a reasoning-driven framework for open-vocabulary text-to-speech that uses natural language instructions with reasoning processes to connect high-level descriptions to acoustic features, improving instruction-following fidelity and expressiveness.

Details

Motivation: Existing InstructTTS methods rely on rigid audio-related labels or their rephrasings, making them insufficient for handling flexible, high-level instructions needed by content creators who want to steer generation with descriptive instructions.

Method: Proposes OV-InstructTTS with: 1) OV-Speech dataset pairing speech with open-vocabulary instructions augmented with reasoning processes connecting high-level instructions to acoustic features, and 2) a reasoning-driven framework that infers emotional, acoustic, and paralinguistic information from instructions before synthesis.

Result: The reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness compared to existing methods.

Conclusion: This work can inspire next-generation user-friendly InstructTTS systems with stronger generalization and real-world applicability, with dataset and demos publicly available.

Abstract: Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability. The dataset and demos are publicly available on our project page.

[485] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Zhaoye Fei, Hanfu Chen, Jingqi Chen, Ke Chen, Qinyuan Cheng, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Shimin Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang

Main category: cs.SD

TL;DR: MOSS Transcribe Diarize is a unified multimodal LLM that performs end-to-end speaker-attributed, time-stamped transcription, outperforming commercial systems with its 128k context window and extensive real-world training.

Details

Motivation: Existing SATS systems lack end-to-end formulation, have limited context windows, weak long-range speaker memory, and cannot output timestamps, creating limitations for meeting transcription applications.

Method: Developed MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs speaker-attributed, time-stamped transcription in an end-to-end paradigm, trained on extensive real wild data with 128k context window for up to 90-minute inputs.

Result: Outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks, demonstrating strong scaling and robust generalization capabilities.

Conclusion: MOSS Transcribe Diarize successfully addresses key limitations of existing SATS systems through its end-to-end multimodal LLM approach, enabling accurate speaker-attributed transcription with precise timing for long meetings.

Abstract: Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

[486] MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo, Xijuan Zeng, Nan Li, Zihan Li, Yuzhe Liang, Ziyu Zhang, Teng Ma, Yushen Chen, Zhongliang Liu, Feng Deng, Chen Zhang, Pengfei Wan

Main category: cs.SD

TL;DR: MM-Sonate is a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities, achieving state-of-the-art performance in lip synchronization and speech intelligibility.

Details

Motivation: Current joint audio-video generation models struggle with fine-grained acoustic control and identity-preserving speech. Existing approaches either have temporal misalignment issues due to cascaded generation or lack zero-shot voice cloning capabilities within a unified framework.

Method: MM-Sonate uses a multimodal flow-matching framework with unified instruction-phoneme input for strict linguistic/temporal alignment. It introduces a timbre injection mechanism to decouple speaker identity from content, and a noise-based negative conditioning strategy to enhance acoustic fidelity beyond standard classifier-free guidance.

Result: MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

Conclusion: MM-Sonate successfully addresses key limitations in joint audio-video generation by providing fine-grained acoustic control, zero-shot voice cloning, and improved temporal alignment through its unified multimodal framework.

Abstract: Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

[487] BeatlesFC: Harmonic function annotations of Isophonics’ The Beatles dataset

Ji Yeoung Sim, Rebecca Moranis, Johanna Devaney

Main category: cs.SD

TL;DR: BeatlesFC provides harmonic function annotations for The Beatles dataset, categorizing chords as stable (tonic) or unstable (predominant/dominant) at phrase level.

Details

Motivation: To bridge the gap between chord labels and higher-level formal structures in music analysis by providing functional harmonic annotations that operate at the phrase level.

Method: Created harmonic function annotations for Isophonics’ The Beatles dataset, characterizing chord labels as stable (tonic) or unstable (predominant, dominant) at the musical phrase level.

Result: Produced BeatlesFC - a comprehensive set of harmonic function annotations that serve as a link between chord labels and higher-level formal structures in The Beatles’ music.

Conclusion: BeatlesFC provides valuable annotations for music analysis and computational musicology, enabling better understanding of harmonic progression and formal structure in popular music.

Abstract: This paper presents BeatlesFC, a set of harmonic function annotations for Isophonics’ The Beatles dataset. Harmonic function annotations characterize chord labels as stable (tonic) or unstable (predominant, dominant). They operate at the level of musical phrases, serving as a link between chord labels and higher-level formal structures.

[488] A Mamba-Based Model for Automatic Chord Recognition

Chunyu Yuan, Johanna Devaney

Main category: cs.SD

TL;DR: BMACE is a bidirectional Mamba-based model for automatic chord estimation that achieves state-of-the-art performance with fewer parameters and lower computational requirements.

Details

Motivation: To develop a more efficient automatic chord estimation model that can effectively capture temporal dependencies in music while reducing computational complexity and parameter count compared to existing approaches.

Method: Proposes BMACE, a bidirectional Mamba-based network that utilizes selective structured state-space models in bidirectional Mamba layers to model temporal dependencies in music for chord estimation.

Result: Achieves high prediction performance comparable to state-of-the-art models while requiring fewer parameters and lower computational resources.

Conclusion: BMACE demonstrates that Mamba-based architectures can provide efficient and effective solutions for automatic chord estimation tasks, offering a good balance between performance and computational efficiency.

Abstract: In this work, we propose a new efficient solution, which is a Mamba-based model named BMACE (Bidirectional Mamba-based network, for Automatic Chord Estimation), which utilizes selective structured state-space models in a bidirectional Mamba layer to effectively model temporal dependencies. Our model achieves high prediction performance comparable to state-of-the-art models, with the advantage of requiring fewer parameters and lower computational resources

[489] DARC: Drum accompaniment generation with fine-grained rhythm control

Trey Brosnan

Main category: cs.SD

TL;DR: DARC is a generative drum accompaniment model that combines musical context conditioning with explicit rhythm control using parameter-efficient fine-tuning of STAGE.

Details

Motivation: Existing generative music tools lack both structural control and stylistic flexibility - stem-to-stem generation offers limited rhythm control, while timbre-transfer methods can't condition on musical context.

Method: Parameter-efficient fine-tuning of STAGE (state-of-the-art drum stem generator) to add fine-grained rhythm control while maintaining musical context awareness, enabling conditioning on both other musical stems and explicit rhythm prompts.

Result: DARC achieves both musical context conditioning (from other stems) and explicit rhythm control (from beatboxing or tapping tracks) in a single model.

Conclusion: DARC successfully addresses the gap between structural control and stylistic flexibility in music generation by combining context awareness with fine-grained rhythm specification.

Abstract: In music creation, rapid prototyping is essential for exploring and refining ideas, yet existing generative tools often fall short when users require both structural control and stylistic flexibility. Prior approaches in stem-to-stem generation can condition on other musical stems but offer limited control over rhythm, and timbre-transfer methods allow users to specify specific rhythms, but cannot condition on musical context. We introduce DARC, a generative drum accompaniment model that conditions both on musical context from other stems and explicit rhythm prompts such as beatboxing or tapping tracks. Using parameter-efficient fine-tuning, we augment STAGE, a state-of-the-art drum stem generator, with fine-grained rhythm control while maintaining musical context awareness.

Johanna Devaney, Daniel McKemie, Alex Morgan

Main category: cs.SD

TL;DR: pyAMPACT is a Python toolkit that bridges symbolic and audio music representations for performance analysis, enabling score-informed audio feature extraction and multi-modal music research.

Details

Motivation: The motivation is to create a unified framework for linking symbolic music scores with audio performances, facilitating comprehensive performance analysis and enabling multi-modal music research that combines different representations of music.

Method: The toolkit uses score alignment to map symbolic notes to audio time-frequency regions, then extracts performance descriptors including tuning, dynamics, timbre, and timing information. It supports multiple symbolic formats and outputs results in MEI-formatted files with note-linked annotations.

Result: pyAMPACT provides a working toolkit that successfully bridges symbolic and audio domains, enabling automated extraction of performance data from audio recordings using score information, and creating linked multi-modal representations for analysis.

Conclusion: pyAMPACT offers a valuable tool for music performance analysis and multi-modal music research by providing infrastructure to link symbolic and audio representations, facilitating both performance data estimation and broader investigations across different music representations.

Abstract: pyAMPACT (Python-based Automatic Music Performance Analysis and Comparison Toolkit) links symbolic and audio music representations to facilitate score-informed estimation of performance data in audio as well as general linking of symbolic and audio music representations with a variety of annotations. pyAMPACT can read a range of symbolic formats and can output note-linked audio descriptors/performance data into MEI-formatted files. The audio analysis uses score alignment to calculate time-frequency regions of importance for each note in the symbolic representation from which to estimate a range of parameters. These include tuning-, dynamics-, and timbre-related performance descriptors, with timing-related information available from the score alignment. Beyond performance data estimation, pyAMPACT also facilitates multi-modal investigations through its infrastructure for linking symbolic representations and annotations to audio.

[491] CMDAR: A Chinese Multi-scene Dynamic Audio Reasoning Benchmark with Diverse Challenges

Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun, Zhixiong Yang, Yifei Cao, Shihan Dou, Xiaoran Fan, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.SD

TL;DR: CMDAR is a Chinese benchmark for evaluating audio reasoning models on complex, multi-scene, dynamically evolving scenarios with diverse audio sources, revealing limitations in current state-of-the-art models.

Details

Motivation: Existing audio reasoning benchmarks are limited to static/single-scene settings, English-only data, and don't capture complex scenarios with multiple speakers, unfolding events, and heterogeneous audio sources interacting.

Method: Created CMDAR benchmark with 3,000 curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and three question types. Evaluated 26 state-of-the-art audio language models.

Result: Models show limitations in complex reasoning: Qwen2.5-Omni achieves 76.67% accuracy on CMDAR-main, GPT-4o Audio gets 68.47%, but GPT-4o Audio substantially outperforms Qwen2.5-Omni on more challenging multiple-choice with multiple audios and open-ended tasks.

Conclusion: Current audio language models have significant limitations in complex reasoning tasks, highlighting the need for improved models that can handle multi-scene, dynamically evolving audio scenarios with diverse sources.

Abstract: The ability to reason from audio, including speech, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and English audio data and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce CMDAR, a Chinese benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. CMDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on CMDAR and observe that they exhibit limitations in complex reasoning tasks. In CMDAR-main, Qwen2.5-Omni achieves 76.67% accuracy, whereas GPT-4o Audio reaches 68.47%. However, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice with multiple audios and open-ended tasks. And we provide detail analysis corresponding suggestions for the future development of large audio language models.

[492] SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion

Hei Shing Cheung, Boya Zhang, Jonathan H. Chan

Main category: cs.SD

TL;DR: A lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that achieves 220× parameter reduction and 52× faster inference while maintaining competitive performance with only 15M parameters.

Details

Motivation: To address critical limitations in existing music AI systems by creating a more efficient, lightweight model that can generate vocal-conditioned musical accompaniment, making AI-assisted music creation accessible for interactive applications and resource-constrained environments.

Method: Introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps. Operates in the compressed latent space of a pre-trained variational autoencoder to achieve significant efficiency gains.

Result: Achieves 220 times parameter reduction compared to state-of-the-art systems with 52 times faster inference. Competitive performance with only 15M parameters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence.

Conclusion: The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation accessible for interactive applications and resource-constrained environments.

Abstract: We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi-scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parameters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation accessible for interactive applications and resource-constrained environments.

[493] Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics

Jonathan Lehmkuhl, Ábel Ilyés-Kun, Nico Bremes, Cemhan Kaan Özaltan, Frederik Muthers, Jiayi Yuan

Main category: cs.SD

TL;DR: Systematic comparison of transformers for symbolic piano music generation, examining datasets, architectures, sizes, and training strategies with human evaluation correlation.

Details

Motivation: Despite many proposed transformers for symbolic music generation, there's limited comprehensive research on how specific design choices affect generated music quality.

Method: Systematic comparison of different datasets, model architectures, model sizes, and training strategies for symbolic piano music generation, using quantitative metrics and human evaluation correlation analysis.

Result: Best-performing model is a 950M-parameter transformer trained on 80K MIDI files from diverse genres, producing outputs often rated as human-composed in Turing-style listening surveys.

Conclusion: Comprehensive study provides insights into effective design choices for symbolic music generation transformers and establishes evaluation metrics that correlate with human judgment.

Abstract: Although a variety of transformers have been proposed for symbolic music generation in recent years, there is still little comprehensive study on how specific design choices affect the quality of the generated music. In this work, we systematically compare different datasets, model architectures, model sizes, and training strategies for the task of symbolic piano music generation. To support model development and evaluation, we examine a range of quantitative metrics and analyze how well they correlate with human judgment collected through listening studies. Our best-performing model, a 950M-parameter transformer trained on 80K MIDI files from diverse genres, produces outputs that are often rated as human-composed in a Turing-style listening survey.

[494] SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li

Main category: cs.SD

TL;DR: SpeakerLM is a unified multimodal LLM that performs speaker diarization and speech recognition end-to-end, overcoming limitations of cascaded systems with flexible speaker registration.

Details

Motivation: Cascaded SDR systems suffer from error propagation, difficulty handling overlapping speech, and lack joint optimization between speaker diarization and speech recognition tasks.

Method: SpeakerLM is a unified multimodal large language model with flexible speaker registration mechanism, trained progressively with multi-stage strategy on large-scale real data.

Result: Outperforms state-of-the-art cascaded baselines on both in-domain and out-of-domain SDR benchmarks, shows strong data scaling capability and generalizability.

Conclusion: SpeakerLM effectively addresses cascaded system limitations through end-to-end joint optimization and flexible speaker registration for diverse real-world scenarios.

Abstract: The Speaker Diarization and Recognition (SDR) task aims to predict “who spoke when and what” within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

[495] AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Xin Li, Peijie Qiu, Hao Wang, Xuanzhao Dong, Yujian Xiong, Anderson Schneider, Yuriy Nevmyvaka, Yalin Wang

Main category: cs.SD

TL;DR: The paper introduces AHA (Audio Hallucination Alignment) framework to address hallucinations in Large Audio-Language Models by creating a preference dataset through counterfactual hard negative mining, and establishes AHA-Eval benchmark for testing temporal reasoning. The aligned model Qwen-Audio-AHA shows significant improvements on both diagnostic and public benchmarks.

Details

Motivation: Large Audio-Language Models (LALMs) suffer from hallucinations where they generate text not grounded in audio input. The authors identify specific types of grounding failures including Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error.

Method: Introduces the AHA framework that uses counterfactual hard negative mining to construct a high-quality preference dataset, forcing models to distinguish strict acoustic evidence from linguistically plausible fabrications. Also establishes AHA-Eval diagnostic benchmark for testing fine-grained temporal reasoning capabilities. The method is applied to align Qwen2.5-Omni.

Result: The resulting model Qwen-Audio-AHA achieves 13.7% improvement on AHA-Eval diagnostic benchmark. The benefits generalize beyond the diagnostic set with substantial gains on public benchmarks: 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.

Conclusion: The AHA framework effectively addresses hallucinations in audio-language models through targeted preference alignment, with demonstrated improvements on both specialized diagnostic tests and general benchmarks. The model and dataset are open-sourced for community use.

Abstract: Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods. The model and dataset are open-sourced at https://github.com/LLM-VLM-GSL/AHA.

[496] DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui, Yong Ma, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng

Main category: cs.SD

TL;DR: DisCo-Speech is a zero-shot controllable TTS framework with a disentangled speech codec that separates content, prosody, and timbre, enabling independent control for speech synthesis.

Details

Motivation: Standard codecs in language models entangle timbre and prosody, which prevents independent control in continuation-based TTS systems, limiting flexible speech synthesis.

Method: Uses DisCodec with two-stage design: 1) tri-factor disentanglement via parallel encoders and hybrid losses to separate speech into content, prosody, and timbre subspaces; 2) fusion and reconstruction that merges content and prosody into unified tokens for LM prediction while optimizing reconstruction.

Result: Achieves competitive voice cloning and superior zero-shot prosody control compared to existing methods.

Conclusion: By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis with flexible zero-shot control.

Abstract: Codec-based language models (LMs) have revolutionized text-to-speech (TTS). However, standard codecs entangle timbre and prosody, which hinders independent control in continuation-based LMs. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework featuring a disentangled speech codec (DisCodec) and an LM-based generator. The core component DisCodec employs a two-stage design: 1) tri-factor disentanglement to separate speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) fusion and reconstruction that merges content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction to address the disentanglement-reconstruction trade-off. This allows the LM to perform prosodic continuation from a style prompt while the decoder injects target timbre, enabling flexible zero-shot control. Experiments demonstrate that DisCo-Speech achieves competitive voice cloning and superior zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis.

[497] Towards Practical Automatic Piano Reduction using BERT with Semi-supervised Learning

Wan Ki Wong, Ka Ho To, Chuck-jee Chau, Lucas Wong, Kevin Y. Yip, Irwin King

Main category: cs.SD

TL;DR: Novel semi-supervised learning approach for automatic piano reduction using music simplification followed by harmonization, leveraging MidiBERT framework to create practical piano reductions with minimal labeled data.

Details

Motivation: Piano reduction is important for musicians and composers as musical sketches, but manual creation is time-consuming. Supervised learning requires large labeled datasets which are difficult to obtain, so semi-supervised learning can leverage abundant classical music data with minimal labeling effort.

Method: Two-step approach: music simplification followed by harmonization. Two solutions implemented using MidiBERT machine learning framework for semi-supervised learning of piano reduction tasks.

Result: Solutions can output practical and realistic piano reduction samples with accurate results that require only small adjustments in post-processing. The approach demonstrates feasibility of semi-supervised learning for this task.

Conclusion: The study establishes groundwork for semi-supervised learning in automatic piano reduction, providing a reference framework for future researchers to build upon and produce state-of-the-art results in music transformation tasks.

Abstract: In this study, we present a novel automatic piano reduction method with semi-supervised machine learning. Piano reduction is an important music transformation process, which helps musicians and composers as a musical sketch for performances and analysis. The automation of such is a highly challenging research problem but could bring huge conveniences as manually doing a piano reduction takes a lot of time and effort. While supervised machine learning is often a useful tool for learning input-output mappings, it is difficult to obtain a large quantity of labelled data. We aim to solve this problem by utilizing semi-supervised learning, so that the abundant available data in classical music can be leveraged to perform the task with little or no labelling effort. In this regard, we formulate a two-step approach of music simplification followed by harmonization. We further propose and implement two possible solutions making use of an existing machine learning framework – MidiBERT. We show that our solutions can output practical and realistic samples with an accurate reduction that needs only small adjustments in post-processing. Our study forms the groundwork for the use of semi-supervised learning in automatic piano reduction, where future researchers can take reference to produce more state-of-the-art results.

cs.LG

[498] Horizon Reduction as Information Loss in Offline Reinforcement Learning

Uday Kumar Nidadala, Venkata Bhumika Guthi

Main category: cs.LG

TL;DR: Horizon reduction in offline RL causes fundamental information loss, making optimal policies statistically indistinguishable from suboptimal ones even with infinite data.

Details

Motivation: While horizon reduction is commonly used in offline RL to mitigate long-horizon credit assignment and improve scalability, its theoretical implications remain underdeveloped despite empirical evidence of benefits.

Method: Formalize horizon reduction as learning from fixed-length trajectory segments, prove theoretical limitations, and demonstrate through minimal counterexample MDPs with three structural failure modes.

Result: Horizon reduction induces irrecoverable information loss where optimal policies become statistically indistinguishable from suboptimal ones, even with infinite data and perfect function approximation.

Conclusion: Horizon reduction has intrinsic limitations that cannot be overcome by algorithmic improvements alone, establishing necessary conditions for safe horizon reduction and complementing existing work on conservative objectives and distribution shift.

Abstract: Horizon reduction is a common design strategy in offline reinforcement learning (RL), used to mitigate long-horizon credit assignment, improve stability, and enable scalable learning through truncated rollouts, windowed training, or hierarchical decomposition (Levine et al., 2020; Prudencio et al., 2023; Park et al., 2025). Despite recent empirical evidence that horizon reduction can improve scaling on challenging offline RL benchmarks, its theoretical implications remain underdeveloped (Park et al., 2025). In this paper, we show that horizon reduction can induce fundamental and irrecoverable information loss in offline RL. We formalize horizon reduction as learning from fixed-length trajectory segments and prove that, under this paradigm and any learning interface restricted to fixed-length trajectory segments, optimal policies may be statistically indistinguishable from suboptimal ones even with infinite data and perfect function approximation. Through a set of minimal counterexample Markov decision processes (MDPs), we identify three distinct structural failure modes: (i) prefix indistinguishability leading to identifiability failure, (ii) objective misspecification induced by truncated returns, and (iii) offline dataset support and representation aliasing. Our results establish necessary conditions under which horizon reduction can be safe and highlight intrinsic limitations that cannot be overcome by algorithmic improvements alone, complementing algorithmic work on conservative objectives and distribution shift that addresses a different axis of offline RL difficulty (Fujimoto et al., 2019; Kumar et al., 2020; Gulcehre et al., 2020).

[499] ShrimpXNet: A Transfer Learning Framework for Shrimp Disease Classification with Augmented Regularization, Adversarial Training, and Explainable AI

Israk Hasan Jone, D. M. Rafiun Bin Masud, Promit Sarker, Sayed Fuad Al Labib, Nazmul Islam, Farhad Billah

Main category: cs.LG

TL;DR: Deep learning approach for automated shrimp disease classification using six pretrained models, with ConvNeXt-Tiny achieving 96.88% accuracy on test dataset.

Details

Motivation: Shrimp farming is economically important but severely impacted by disease outbreaks. Automated disease classification methods can provide timely and accurate detection to support sustainable shrimp production.

Method: Used dataset of 1,149 images across four disease classes. Evaluated six pretrained models (ResNet50, EfficientNet, DenseNet201, MobileNet, ConvNeXt-Tiny, Xception) with background removal and standardized preprocessing. Applied FGSM for adversarial training, CutMix and MixUp for augmentation, and Grad-CAM methods for interpretability.

Result: ConvNeXt-Tiny achieved the highest performance with 96.88% accuracy on test dataset. After 1000 iterations, the 99% confidence interval for the model is [0.953,0.971].

Conclusion: The proposed deep learning approach effectively classifies shrimp diseases, with ConvNeXt-Tiny demonstrating superior performance. The method offers a promising automated solution for disease detection in shrimp farming to support sustainable aquaculture.

Abstract: Shrimp is one of the most widely consumed aquatic species globally, valued for both its nutritional content and economic importance. Shrimp farming represents a significant source of income in many regions; however, like other forms of aquaculture, it is severely impacted by disease outbreaks. These diseases pose a major challenge to sustainable shrimp production. To address this issue, automated disease classification methods can offer timely and accurate detection. This research proposes a deep learning-based approach for the automated classification of shrimp diseases. A dataset comprising 1,149 images across four disease classes was utilized. Six pretrained deep learning models, ResNet50, EfficientNet, DenseNet201, MobileNet, ConvNeXt-Tiny, and Xception were deployed and evaluated for performance. The images background was removed, followed by standardized preprocessing through the Keras image pipeline. Fast Gradient Sign Method (FGSM) was used for enhancing the model robustness through adversarial training. While advanced augmentation strategies, including CutMix and MixUp, were implemented to mitigate overfitting and improve generalization. To support interpretability, and to visualize regions of model attention, post-hoc explanation methods such as Grad-CAM, Grad-CAM++, and XGrad-CAM were applied. Exploratory results demonstrated that ConvNeXt-Tiny achieved the highest performance, attaining a 96.88% accuracy on the test dataset. After 1000 iterations, the 99% confidence interval for the model is [0.953,0.971].

[500] Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds

Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto

Main category: cs.LG

TL;DR: IM-PINN: A mesh-free geometric deep learning framework that solves PDEs on complex manifolds by embedding Riemannian metrics into neural networks, achieving superior mass conservation over traditional methods.

Details

Motivation: Traditional methods for simulating reaction-diffusion dynamics on complex manifolds face challenges with mesh generation costs and symplectic drift in time-stepping, limiting their ability to handle extreme curvature variations and anisotropic patterns.

Method: Intrinsic-Metric Physics-Informed Neural Network (IM-PINN) embeds Riemannian metric tensor into automatic differentiation to analytically reconstruct Laplace-Beltrami operator. Uses dual-stream architecture with Fourier feature embeddings to mitigate spectral bias and decouple solution complexity from geometric discretization.

Result: Successfully simulates Gray-Scott model on “Stochastic Cloth” manifold with extreme curvature (K ∈ [-2489, 3580]), recovering “splitting spot” and “labyrinthine” regimes. Achieves global mass conservation error of 0.157 vs SFEM’s 0.258, eliminating mass drift inherent in semi-implicit integration.

Conclusion: IM-PINN provides a memory-efficient, resolution-independent paradigm for simulating biological pattern formation on evolving surfaces, bridging differential geometry with physics-informed machine learning while maintaining thermodynamic consistency.

Abstract: Simulating nonlinear reaction-diffusion dynamics on complex, non-Euclidean manifolds remains a fundamental challenge in computational morphogenesis, constrained by high-fidelity mesh generation costs and symplectic drift in discrete time-stepping schemes. This study introduces the Intrinsic-Metric Physics-Informed Neural Network (IM-PINN), a mesh-free geometric deep learning framework that solves partial differential equations directly in the continuous parametric domain. By embedding the Riemannian metric tensor into the automatic differentiation graph, our architecture analytically reconstructs the Laplace-Beltrami operator, decoupling solution complexity from geometric discretization. We validate the framework on a “Stochastic Cloth” manifold with extreme Gaussian curvature fluctuations ($K \in [-2489, 3580]$), where traditional adaptive refinement fails to resolve anisotropic Turing instabilities. Using a dual-stream architecture with Fourier feature embeddings to mitigate spectral bias, the IM-PINN recovers the “splitting spot” and “labyrinthine” regimes of the Gray-Scott model. Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of $\mathcal{E}_{mass} \approx 0.157$ versus SFEM’s $0.258$, acting as a thermodynamically consistent global solver that eliminates mass drift inherent in semi-implicit integration. The framework offers a memory-efficient, resolution-independent paradigm for simulating biological pattern formation on evolving surfaces, bridging differential geometry and physics-informed machine learning.

[501] SLO-Conditioned Action Routing for Retrieval-Augmented Generation: Objective Ablation and Failure Modes

Bharath Nunepalli

Main category: cs.LG

TL;DR: This paper presents a case study on controlling RAG pipelines to meet service-level objectives (SLOs) like cost, refusal rate, and hallucination risk, using simple policy learning methods rather than proposing new models.

Details

Motivation: RAG systems need practical control mechanisms to balance trade-offs between cost, quality (refusal rate), and safety (hallucination risk) per query to satisfy specific SLOs, but existing work lacks systematic study of these control problems.

Method: Models per-query control as discrete actions (retrieval depth, generation mode: guarded vs. auto, or refuse). Uses SQuAD 2.0 to create offline dataset with accuracy, cost, hallucination/refusal metrics. Evaluates two policy-learning objectives: supervised classification of best action (Argmax-CE) and reward-weighted variant (Argmax-CE-WT).

Result: Fixed baseline (low k, guarded prompting) performs competitively. Learned policies mainly provide cost savings under quality-focused SLOs, but can show refusal collapse under cheap SLOs when refusal is heavily rewarded. No significant advantage over strong baselines.

Conclusion: The paper provides a reproducible case study emphasizing failure modes and reporting conventions for SLO-aware RAG control, highlighting that simple baselines are strong and learned policies have limited benefits with specific failure patterns.

Abstract: Retrieval-augmented generation (RAG) introduces a practical control problem: retrieval depth and generation behavior must be chosen per query to satisfy service-level objectives (SLOs) such as cost, refusal rate, and hallucination risk. This work models per-query control as a small discrete action: choose a retrieval depth and a generation mode (guarded vs. auto), or refuse. An offline logged dataset is constructed from SQuAD 2.0 by executing each action and recording accuracy, token cost, hallucination/refusal indicators, and an SLO-weighted reward. Two simple policy-learning objectives are evaluated: supervised classification of the per-state best action (Argmax-CE) and a reward-weighted variant (Argmax-CE-WT). Across the evaluated settings, a strong fixed baseline (low k, guarded prompting) performs competitively; learned policies mainly provide additional cost savings under a quality-focused SLO and can exhibit refusal collapse under a cheap SLO when refusal is heavily rewarded. The contribution is a reproducible case study of SLO-aware control for RAG pipelines, emphasizing failure modes and reporting conventions rather than proposing a new retriever or language model.

[502] Value-guided action planning with JEPA world models

Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, Yann LeCun

Main category: cs.LG

TL;DR: The paper proposes a method to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function is approximated by a distance between state embeddings, leading to improved planning performance.

Details

Motivation: While JEPA provides a promising framework for modeling environmental dynamics through self-supervised prediction, its ability to support effective action planning remains limited. The authors aim to enhance planning capabilities in JEPA world models.

Method: The approach shapes the JEPA representation space so that the negative goal-conditioned value function for a reaching cost is approximated by a distance (or quasi-distance) between state embeddings. The authors introduce a practical training method to enforce this constraint.

Result: The proposed method leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.

Conclusion: By structuring the representation space to align with value functions, JEPA world models can be enhanced for more effective planning, bridging the gap between representation learning and action planning in dynamic environments.

Abstract: Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.

[503] Warp-Cortex: An Asynchronous, Memory-Efficient Architecture for Million-Agent Cognitive Scaling on Consumer Hardware

Jorge L. Ruiz Williams

Main category: cs.LG

TL;DR: Warp Cortex enables million-agent LLM scaling on consumer hardware by decoupling agent logic from memory, reducing memory complexity from O(NL) to O(1) for weights and O(Nk) for context.

Details

Motivation: Current multi-agent LLM frameworks suffer from linear memory scaling (O(N*L)), making parallel reasoning impractical on consumer hardware. There's a need for architectures that can support large-scale multi-agent systems without prohibitive memory requirements.

Method: 1) Asynchronous architecture decoupling agent logic from physical memory; 2) Singleton Weight Sharing to reduce weight memory to O(1); 3) Topological Synapse using TDA-inspired hybrid landmarking to reduce context memory to O(N*k); 4) Treating KV-cache as point cloud in latent space with witness-complex-inspired sparsification; 5) Referential Injection for non-intrusive KV-cache updates.

Result: On a single NVIDIA RTX 4090: 100 concurrent agents at 2.2 GB total VRAM, with theoretical capacity exceeding 1,000 agents before compute latency becomes bottleneck. Memory complexity reduced from O(NL) to O(1) for weights and O(Nk) for context (where k « L).

Conclusion: Warp Cortex enables practical million-agent cognitive scaling on consumer hardware through memory-efficient asynchronous architecture, topological sparsification, and novel KV-cache management techniques, making large-scale multi-agent LLM systems feasible.

Abstract: Current multi-agent Large Language Model (LLM) frameworks suffer from linear memory scaling, rendering “System 2” parallel reasoning impractical on consumer hardware. We present Warp Cortex, an asynchronous architecture that theoretically enables million-agent cognitive scaling by decoupling agent logic from physical memory. Through Singleton Weight Sharing and a novel Topological Synapse–inspired by hybrid landmarking techniques from Topological Data Analysis (TDA)–we reduce memory complexity from O(N * L) to O(1) for weights and O(N * k) for context, where k « L. By treating the KV-cache as a point cloud in latent space, we apply witness-complex-inspired sparsification to preserve persistent homological features of the context manifold. On a single NVIDIA RTX 4090, we empirically demonstrate 100 concurrent agents at 2.2 GB total VRAM, with theoretical capacity exceeding 1,000 agents before compute latency becomes the bottleneck. We further introduce Referential Injection, a non-intrusive KV-cache update mechanism that allows asynchronous sub-agents to influence primary generation without stream disruption.

[504] You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

Ryan Shamim

Main category: cs.LG

TL;DR: MFEE introduces a control-plane architecture that selectively invokes transformer inference only when necessary, achieving 78.1% execution reduction while maintaining 100% exact-match correctness.

Details

Motivation: Current AI inference systems treat transformer execution as mandatory, conflating model capability with execution necessity. This leads to unnecessary computational overhead when correctness could be preserved through alternative pathways.

Method: Meaning-First Execution (MFEE) is a control-plane architecture that operates as a gating layer above existing stacks without modifying models, weights, or parameters. It uses semantic analysis to determine when transformer execution is necessary versus when correctness can be preserved through alternative pathways.

Result: Across 1,000 diverse prompts under deterministic decoding, MFEE achieves 78.1% execution reduction while maintaining 100% exact-match equivalence for invoked executions. Pattern-based routers achieve at most 53.3% avoidance with correctness failures, while MFEE reaches 100% avoidance with zero failures.

Conclusion: Execution governance should be established as a foundational layer in ML systems infrastructure, orthogonal to model-level optimization techniques. The paper proves that routers operating solely on finite feature maps cannot simultaneously guarantee zero false skips and positive avoidance on feature-collision pairs.

Abstract: Modern AI inference systems treat transformer execution as mandatory, conflating model capability with execution necessity. We reframe inference as a control-plane decision problem: determining when execution is necessary versus when correctness can be preserved through alternative pathways. We introduce Meaning-First Execution (MFEE), a control-plane architecture implementing this framework, selectively invoking transformer inference only when required. MFEE operates as a gating layer above existing stacks without modifying models, weights, or parameters. Across 1,000 diverse prompts under deterministic decoding, MFEE achieves 78.1% execution reduction while maintaining 100% exact-match equivalence for invoked executions. Comparative evaluation reveals pattern-based routers achieve at most 53.3% avoidance with correctness failures, while MFEE reaches 100% avoidance with zero failures through semantic analysis. We prove this limitation via Theorem 1: any router operating solely on finite feature maps cannot simultaneously guarantee zero false skips and positive avoidance on feature-collision pairs. These results establish execution governance as a foundational layer in ML systems infrastructure, orthogonal to model-level optimization techniques.

[505] Learning Resilient Elections with Adversarial GNNs

Hao Xiang Li, Yash Shah, Lorenzo Giusti

Main category: cs.LG

TL;DR: The paper proposes using graph neural networks on bipartite graph representations of elections to learn robust voting rules that maximize social welfare through adversarial training.

Details

Motivation: Voting rules are crucial for democracy and various applications, but designing universal rules that work in all scenarios is challenging. While automated mechanism design and set-invariant architectures show promise, existing approaches lack robustness to strategic voting and have limitations preventing real-world application.

Method: Represent elections as bipartite graphs and use graph neural networks to learn voting rules. Combine architectural improvements with adversarial training to enhance resilience while maximizing social welfare.

Result: The method resolves critical limitations of prior work and shows effectiveness on both synthetic and real-world datasets, opening new frontiers for applying machine learning to real-world elections.

Conclusion: The approach generalizes learned voting rules’ expressive capability and improves their robustness, making machine learning applications to real-world elections more feasible.

Abstract: In the face of adverse motives, it is indispensable to achieve a consensus. Elections have been the canonical way by which modern democracy has operated since the 17th century. Nowadays, they regulate markets, provide an engine for modern recommender systems or peer-to-peer networks, and remain the main approach to represent democracy. However, a desirable universal voting rule that satisfies all hypothetical scenarios is still a challenging topic, and the design of these systems is at the forefront of mechanism design research. Automated mechanism design is a promising approach, and recent works have demonstrated that set-invariant architectures are uniquely suited to modelling electoral systems. However, various concerns prevent the direct application to real-world settings, such as robustness to strategic voting. In this paper, we generalise the expressive capability of learned voting rules, and combine improvements in neural network architecture with adversarial training to improve the resilience of voting rules while maximizing social welfare. We evaluate the effectiveness of our methods on both synthetic and real-world datasets. Our method resolves critical limitations of prior work regarding learning voting rules by representing elections using bipartite graphs, and learning such voting rules using graph neural networks. We believe this opens new frontiers for applying machine learning to real-world elections.

[506] EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference

Aayush Kumar

Main category: cs.LG

TL;DR: EdgeJury is a lightweight ensemble framework using small language models (3B-8B) that improves truthfulness in question answering through parallel generation, cross-review, synthesis, and consistency checking, achieving significant accuracy gains on TruthfulQA and adversarial benchmarks.

Details

Motivation: Hallucinations in question answering are problematic, especially in resource-constrained deployments where large models or retrieval systems are impractical. There's a need for lightweight solutions that can improve truthfulness without relying on frontier-scale models or complex retrieval pipelines.

Method: EdgeJury uses a four-stage ensemble framework: (1) parallel role-specialized generation, (2) anonymized cross-review with structured critiques and rankings, (3) chairman synthesis integrating best content while addressing flagged issues, and (4) claim-level consistency labeling based on inter-model agreement. All using only small instruction-tuned language models (3B-8B).

Result: On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy (+21.4% relative improvement over single 8B baseline). On adversarial EdgeCases set, +48.2% relative gains. Manual analysis shows ~55% reduction in factual hallucination errors. Deployed on Cloudflare Workers AI with 8.4s median end-to-end latency.

Conclusion: Coordinated small-model ensembles can significantly improve truthfulness on misconception-heavy QA benchmarks without external retrieval or proprietary large-model APIs, making them suitable for serverless edge inference deployments.

Abstract: Hallucinations hinder reliable question answering, especially in resource-constrained deployments where frontier-scale models or retrieval pipelines may be impractical. We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness using only small instruction-tuned language models (3B-8B) suitable for serverless edge inference. EdgeJury orchestrates four stages: (1) parallel role-specialized generation, (2) anonymized cross-review with structured critiques and rankings, (3) chairman synthesis that integrates the strongest content while addressing flagged issues, and (4) claim-level consistency labeling based on inter-model agreement. On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy (95% CI: 72.8-79.6%), a +21.4% relative improvement over a single 8B baseline (62.8%), and outperforms standard baselines including self-consistency and majority voting under transparent compute accounting (total tokens and platform cost reported). On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains (95% CI: 44.0-52.4%). Manual analysis on 100 incorrect answers shows an approximately 55% reduction in factual hallucination errors versus the single-model baseline. Deployed on Cloudflare Workers AI, EdgeJury achieves 8.4 s median end-to-end latency, demonstrating that coordinated small-model ensembles can improve truthfulness on misconception-heavy QA benchmarks without external retrieval or proprietary large-model APIs.

[507] FedSCAM (Federated Sharpness-Aware Minimization with Clustered Aggregation and Modulation): Scam-resistant SAM for Robust Federated Optimization in Heterogeneous Environments

Sameer Rahil, Zain Abdullah Ahmad, Talha Asif

Main category: cs.LG

TL;DR: FedSCAM is a federated learning algorithm that dynamically adjusts SAM perturbation radius and aggregation weights based on client heterogeneity to handle non-IID data distributions.

Details

Motivation: Statistical heterogeneity (non-IID label distributions) in federated learning causes convergence and generalization challenges. Existing SAM-based FL methods use uniform perturbation radii across clients, ignoring client-specific heterogeneity.

Method: FedSCAM calculates client-specific heterogeneity scores, modulates SAM perturbation radius inversely to these scores, and uses heterogeneity-aware weighted aggregation that prioritizes updates aligned with global optimization direction.

Result: Extensive experiments on CIFAR-10 and Fashion-MNIST with Dirichlet-based label skew show FedSCAM achieves competitive performance in convergence speed and final test accuracy compared to state-of-the-art baselines like FedSAM and FedLESAM.

Conclusion: FedSCAM effectively addresses client heterogeneity in FL by dynamically adjusting perturbation and aggregation strategies, leading to improved convergence and generalization in non-IID settings.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized edge devices while preserving data privacy. However, statistical heterogeneity among clients, often manifested as non-IID label distributions, poses significant challenges to convergence and generalization. While Sharpness-Aware Minimization (SAM) has been introduced to FL to seek flatter, more robust minima, existing approaches typically apply a uniform perturbation radius across all clients, ignoring client-specific heterogeneity. In this work, we propose \textbf{FedSCAM} (Federated Sharpness-Aware Minimization with Clustered Aggregation and Modulation), a novel algorithm that dynamically adjusts the SAM perturbation radius and aggregation weights based on client-specific heterogeneity scores. By calculating a heterogeneity metric for each client and modulating the perturbation radius inversely to this score, FedSCAM prevents clients with high variance from destabilizing the global model. Furthermore, we introduce a heterogeneity-aware weighted aggregation mechanism that prioritizes updates from clients that align with the global optimization direction. Extensive experiments on CIFAR-10 and Fashion-MNIST under various degrees of Dirichlet-based label skew demonstrate that FedSCAM achieves competitive performance among state-of-the-art baselines, including FedSAM, FedLESAM, etc. in terms of convergence speed and final test accuracy.

[508] HyperCLOVA X 8B Omni

NAVER Cloud HyperCLOVA X Team

Main category: cs.LG

TL;DR: HyperCLOVA X 8B Omni is an 8B-parameter any-to-any omnimodal model supporting text, audio, and vision as both inputs and outputs, serving as a unified multimodal assistant.

Details

Motivation: To create a practical any-to-any omni assistant by consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, enabling seamless interaction across text, audio, and vision modalities.

Method: Unifies modalities through shared next-token prediction over interleaved multimodal sequences, with vision and audio encoders injecting continuous embeddings for fine-grained understanding and grounding. The model supports both Korean and English languages.

Result: Demonstrates competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision in both Korean and English evaluations.

Conclusion: HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants, with open-weight release anticipated to support wide research and deployment scenarios.

Abstract: In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.

[509] Harvesting AlphaEarth: Benchmarking the Geospatial Foundation Model for Agricultural Downstream Tasks

Yuchi Ma, Yawen Shen, Anu Swatantran, David B. Lobell

Main category: cs.LG

TL;DR: Evaluation of Google DeepMind’s AlphaEarth Foundation (AEF) geospatial foundation model embeddings for agricultural monitoring tasks in the U.S., comparing against traditional remote sensing models.

Details

Motivation: Geospatial foundation models (GFMs) like AEF show promise but lack comprehensive evaluation in agricultural applications. Previous AEF experiments focused mainly on land cover/use classification, leaving gaps in agricultural monitoring tasks and comparisons with traditional remote sensing models.

Method: Evaluated AEF embeddings on three U.S. agricultural tasks: crop yield prediction, tillage mapping, and cover crop mapping. Compiled datasets from public and private sources across different scales and locations. Trained remote sensing-based models as comparison benchmarks.

Result: AEF-based models showed strong performance across all tasks and were competitive with purpose-built RS-based models in yield prediction and county-level tillage mapping when trained on local data. However, AEF embeddings showed limited spatial transferability, low interpretability, and limited time sensitivity compared to RS-based models.

Conclusion: While AEF embeddings demonstrate strong agricultural monitoring capabilities, their limitations in spatial transferability, interpretability, and time sensitivity require caution in agricultural applications where these factors are important. The study provides valuable guidance for researchers and practitioners considering GFM applications in agriculture.

Abstract: Geospatial foundation models (GFMs) have emerged as a promising approach to overcoming the limitations in existing featurization methods. More recently, Google DeepMind has introduced AlphaEarth Foundation (AEF), a GFM pre-trained using multi-source EOs across continuous time. An annual and global embedding dataset is produced using AEF that is ready for analysis and modeling. The internal experiments show that AEF embeddings have outperformed operational models in 15 EO tasks without re-training. However, those experiments are mostly about land cover and land use classification. Applying AEF and other GFMs to agricultural monitoring require an in-depth evaluation in critical agricultural downstream tasks. There is also a lack of comprehensive comparison between the AEF-based models and traditional remote sensing (RS)-based models under different scenarios, which could offer valuable guidance for researchers and practitioners. This study addresses some of these gaps by evaluating AEF embeddings in three agricultural downstream tasks in the U.S., including crop yield prediction, tillage mapping, and cover crop mapping. Datasets are compiled from both public and private sources to comprehensively evaluate AEF embeddings across tasks at different scales and locations, and RS-based models are trained as comparison models. AEF-based models generally exhibit strong performance on all tasks and are competitive with purpose-built RS-based models in yield prediction and county-level tillage mapping when trained on local data. However, we also find several limitations in current AEF embeddings, such as limited spatial transferability compared to RS-based models, low interpretability, and limited time sensitivity. These limitations recommend caution when applying AEF embeddings in agriculture, where time sensitivity, generalizability, and interpretability is important.

[510] Perch 2.0: The Bittern Lesson for Bioacoustics

Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, Tom Denton

Main category: cs.LG

TL;DR: Perch 2.0 is an enhanced bioacoustics model that expands from avian-only to multi-taxa training, achieving SOTA performance on benchmarks and strong transfer learning capabilities even without marine training data.

Details

Motivation: The motivation is to create a more versatile bioacoustic model that goes beyond avian species classification to handle multiple taxa, while improving performance and transfer learning capabilities across different biological domains.

Method: The model uses supervised training on a large multi-taxa dataset with self-distillation, incorporating a prototype-learning classifier and a new source-prediction training criterion to enhance learning.

Result: Perch 2.0 achieves state-of-the-art performance on BirdSet and BEANS benchmarks, outperforms specialized marine models on marine transfer learning tasks despite minimal marine training data, and provides strong embeddings for transfer learning.

Conclusion: Fine-grained species classification serves as a robust pre-training task for bioacoustics, enabling the model to learn transferable representations that generalize well across different biological domains and tasks.

Abstract: Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.

[511] Path Integral Solution for Dissipative Generative Dynamics

Xidi Wang

Main category: cs.LG

TL;DR: Mechanical systems can generate intelligent language through dissipative quantum dynamics with non-local context aggregation, while conservation laws cause fundamental failure in language generation.

Details

Motivation: To investigate whether purely mechanical systems can generate intelligent language, and to understand the fundamental requirements for coherent text generation in physical systems.

Method: Using Koopman operators with closed-form path integral propagators to analyze dissipative quantum dynamics, with spectral analysis revealing emergent eigenvalue structure separating into decay, growth, and neutral modes.

Result: Dissipative quantum dynamics with non-local context aggregation produces coherent text generation, while Hamiltonian constraints (conservation laws) force elimination of dissipative modes and degrade performance despite unchanged model capacity.

Conclusion: Language generation is established as dissipative quantum field theory, proving mechanical systems acquire intelligence through the combination of dissipation and non-locality, not through conservation.

Abstract: Can purely mechanical systems generate intelligent language? We prove that dissipative quantum dynamics with analytically tractable non-local context aggregation produce coherent text generation, while conservation laws cause fundamental failure. Employing Koopman operators with closed-form path integral propagators, we show irreversible computation fundamentally requires both controlled information dissipation and causal context aggregation. Spectral analysis reveals emergent eigenvalue structure, separating into decay modes (forgetting), growth modes (amplification), and neutral modes (preservation) – the essential ingredients for directed information flow. Hamiltonian constraints force the elimination of these dissipative modes and degrading performance despite unchanged model capacity. This establishes language generation as dissipative quantum field theory, proving mechanical systems acquire intelligence through the combination of dissipation and non-locality, not through conservation.

[512] Universal Battery Degradation Forecasting Driven by Foundation Model Across Diverse Chemistries and Conditions

Joey Chan, Huan Wang, Haoyu Pan, Wei Wu, Zirong Wang, Zhen Chen, Ershun Pan, Min Xie, Lifeng Xi

Main category: cs.LG

TL;DR: A unified battery capacity forecasting framework using Time-Series Foundation Model with LoRA adaptation and physics-guided contrastive learning achieves robust performance across diverse chemistries and operating conditions.

Details

Motivation: Accurate battery capacity fade forecasting is crucial for energy storage system safety and reliability, but strong heterogeneity across cell chemistries, form factors, and operating conditions makes generalization difficult for single models.

Method: Curated 20 public aging datasets into large corpus (1,704 cells, 3.96M cycle segments), used Time-Series Foundation Model backbone with parameter-efficient Low-Rank Adaptation (LoRA) and physics-guided contrastive representation learning to capture shared degradation patterns.

Result: Single unified model achieves competitive/superior accuracy compared to per-dataset baselines, maintains stable performance on unseen chemistries, capacity scales, and operating conditions excluded from training.

Conclusion: TSFM-based architectures show potential as scalable, transferable solution for capacity degradation forecasting in real battery management systems, enabling robust performance across diverse scenarios.

Abstract: Accurate forecasting of battery capacity fade is essential for the safety, reliability, and long-term efficiency of energy storage systems. However, the strong heterogeneity across cell chemistries, form factors, and operating conditions makes it difficult to build a single model that generalizes beyond its training domain. This work proposes a unified capacity forecasting framework that maintains robust performance across diverse chemistries and usage scenarios. We curate 20 public aging datasets into a large-scale corpus covering 1,704 cells and 3,961,195 charge-discharge cycle segments, spanning temperatures from $-5,^{\circ}\mathrm{C}$ to $45,^{\circ}\mathrm{C}$, multiple C-rates, and application-oriented profiles such as fast charging and partial cycling. On this corpus, we adopt a Time-Series Foundation Model (TSFM) backbone and apply parameter-efficient Low-Rank Adaptation (LoRA) together with physics-guided contrastive representation learning to capture shared degradation patterns. Experiments on both seen and deliberately held-out unseen datasets show that a single unified model achieves competitive or superior accuracy compared with strong per-dataset baselines, while retaining stable performance on chemistries, capacity scales, and operating conditions excluded from training. These results demonstrate the potential of TSFM-based architectures as a scalable and transferable solution for capacity degradation forecasting in real battery management systems.

[513] Selective Imperfection as a Generative Framework for Analysis, Creativity and Discovery

Markus J. Buehler

Main category: cs.LG

TL;DR: A generative framework linking hierarchical structures of matter with musical composition through reversible mappings, revealing that novelty emerges when constraints force expansion of viable configurations, with quantitative parallels between optimal musical scales and material properties.

Details

Motivation: To establish a generative framework connecting the hierarchical structures of matter with musical composition, exploring how sound can function as a scientific probe and how musical composition can serve as a blueprint for matter.

Method: Using reversible mappings from molecular spectra to musical tones and from 3D networks to playable instruments; exhaustive enumeration of all 2^12 musical scales; swarm-based AI models for music composition; quantitative analysis of scale properties.

Result: Culturally significant musical systems cluster in a mid-entropy, mid-defect corridor paralleling the Hall-Petch optimum in materials; swarm-based AI models compose music with human-like structural signatures; deep temporal patterns become audible through mappings.

Conclusion: Science and art are generative acts of world-building under constraint, with vibration as a shared grammar organizing structure across scales; novelty emerges when constraints force expansion of viable configurations, with selective imperfection balancing coherence and adaptability.

Abstract: We introduce materiomusic as a generative framework linking the hierarchical structures of matter with the compositional logic of music. Across proteins, spider webs and flame dynamics, vibrational and architectural principles recur as tonal hierarchies, harmonic progressions, and long-range musical form. Using reversible mappings, from molecular spectra to musical tones and from three-dimensional networks to playable instruments, we show how sound functions as a scientific probe, an epistemic inversion where listening becomes a mode of seeing and musical composition becomes a blueprint for matter. These mappings excavate deep time: patterns originating in femtosecond molecular vibrations or billion-year evolutionary histories become audible. We posit that novelty in science and art emerges when constraints cannot be satisfied within existing degrees of freedom, forcing expansion of the space of viable configurations. Selective imperfection provides the mechanism restoring balance between coherence and adaptability. Quantitative support comes from exhaustive enumeration of all 2^12 musical scales, revealing that culturally significant systems cluster in a mid-entropy, mid-defect corridor, directly paralleling the Hall-Petch optimum where intermediate defect densities maximize material strength. Iterating these mappings creates productive collisions between human creativity and physics, generating new information as musical structures encounter evolutionary constraints. We show how swarm-based AI models compose music exhibiting human-like structural signatures such as small-world connectivity, modular integration, long-range coherence, suggesting a route beyond interpolation toward invention. We show that science and art are generative acts of world-building under constraint, with vibration as a shared grammar organizing structure across scales.

[514] Distribution Matching for Graph Quantification Under Structural Covariate Shift

Clemens Damke, Eyke Hüllermeier

Main category: cs.LG

TL;DR: Extends structural importance sampling to KDEy quantification for graph data to handle structural shifts between training and test data.

Details

Motivation: In graph-based quantification learning, traditional prior probability shift assumption fails when there are structural shifts between training and test data from different graph regions. Need methods that adapt to such structural shifts.

Method: Extends structural importance sampling to state-of-the-art KDEy quantification approach. Adapts quantification methods to handle structural shifts in graph data.

Result: Proposed method adapts to structural shifts and outperforms standard quantification approaches in graph settings.

Conclusion: Structural importance sampling extension to KDEy quantification effectively handles structural shifts in graph data, improving performance over existing methods.

Abstract: Graphs are commonly used in machine learning to model relationships between instances. Consider the task of predicting the political preferences of users in a social network; to solve this task one should consider, both, the features of each individual user and the relationships between them. However, oftentimes one is not interested in the label of a single instance but rather in the distribution of labels over a set of instances; e.g., when predicting the political preferences of users, the overall prevalence of a given opinion might be of higher interest than the opinion of a specific person. This label prevalence estimation task is commonly referred to as quantification learning (QL). Current QL methods for tabular data are typically based on the so-called prior probability shift (PPS) assumption which states that the label-conditional instance distributions should remain equal across the training and test data. In the graph setting, PPS generally does not hold if the shift between training and test data is structural, i.e., if the training data comes from a different region of the graph than the test data. To address such structural shifts, an importance sampling variant of the popular adjusted count quantification approach has previously been proposed. In this work, we extend the idea of structural importance sampling to the state-of-the-art KDEy quantification approach. We show that our proposed method adapts to structural shifts and outperforms standard quantification approaches.

[515] A-PINN: Auxiliary Physics-informed Neural Networks for Structural Vibration Analysis in Continuous Euler-Bernoulli Beam

Shivani Saini, Ramesh Kumar Vats, Arup Kumar Sahoo

Main category: cs.LG

TL;DR: A modified Auxiliary PINN framework with balanced adaptive optimizers improves structural vibration analysis, showing 40%+ improvement over baselines in solving Euler-Bernoulli beam equations.

Details

Motivation: PINNs have shown effectiveness in solving differential equation problems, but need improvement for structural vibration analysis where accurate representation of vibration phenomena is critical for reliable predictive analysis and deeper insight into scientific machine learning robustness.

Method: Proposed a modified Auxiliary physics-informed neural network (A-PINN) framework with balanced adaptive optimizers for structural vibration problems, tested through numerical simulations approximating Euler-Bernoulli beam equations under various scenarios.

Result: The model demonstrates enhanced performance with at least 40% improvement over baselines in both numerical stability and predictive accuracy for solving structural vibration problems.

Conclusion: The modified A-PINN with balanced adaptive optimizers provides superior performance for structural vibration analysis, offering improved numerical stability and accuracy for solving Euler-Bernoulli beam equations.

Abstract: Recent advancements in physics-informed neural networks (PINNs) and their variants have garnered substantial focus from researchers due to their effectiveness in solving both forward and inverse problems governed by differential equations. In this research, a modified Auxiliary physics-informed neural network (A-PINN) framework with balanced adaptive optimizers is proposed for the analysis of structural vibration problems. In order to accurately represent structural systems, it is critical for capturing vibration phenomena and ensuring reliable predictive analysis. So, our investigations are crucial for gaining deeper insight into the robustness of scientific machine learning models for solving vibration problems. Further, to rigorously evaluate the performance of A-PINN, we conducted different numerical simulations to approximate the Euler-Bernoulli beam equations under the various scenarios. The numerical results substantiate the enhanced performance of our model in terms of both numerical stability and predictive accuracy. Our model shows improvement of at least 40% over the baselines.

Aditya Sreevatsa K, Arun Kumar Raveendran, Jesrael K Mani, Prakash G Shigli, Rajkumar Rangadore, Narayana Darapaneni, Anwesh Reddy Paduri

Main category: cs.LG

TL;DR: SmartFlow is an AI framework combining Reinforcement Learning and Agentic AI to dynamically rebalance bike-sharing systems, reducing network imbalance by over 95% while minimizing travel distance and improving operational efficiency.

Details

Motivation: The paper addresses the dynamic rebalancing problem in urban bike-sharing services, which requires efficient redistribution of bikes to meet demand, reduce idle time, improve bike availability, and lower operational costs in complex urban mobility networks.

Method: Multi-layered framework with three components: 1) Strategic level using Deep Q-Network (DQN) trained in high-fidelity simulation of NYC Citi Bike network, modeling the problem as Markov Decision Process; 2) Tactical module optimizing multi-leg journeys and scheduling just-in-time dispatches; 3) Communication layer with grounded Agentic AI using LLM to translate plans into actionable instructions for staff.

Result: Evaluation shows SmartFlow reduces network imbalance by over 95%, requires minimal travel distance, achieves strong truck utilization, and bridges machine intelligence with human operations through interpretable instructions.

Conclusion: SmartFlow provides a scalable, interpretable AI-driven solution for urban mobility logistics that effectively integrates machine intelligence with human operations, offering a blueprint for complex urban transportation networks.

Abstract: SmartFlow is a multi-layered framework that integrates Reinforcement Learning and Agentic AI to address the dynamic rebalancing problem in urban bike-sharing services. Its architecture separates strategic, tactical, and communication functions for clarity and scalability. At the strategic level, a Deep Q-Network (DQN) agent, trained in a high-fidelity simulation of New Yorks Citi Bike network, learns robust rebalancing policies by modelling the challenge as a Markov Decision Process. These high-level strategies feed into a deterministic tactical module that optimises multi-leg journeys and schedules just-in-time dispatches to minimise fleet travel. Evaluation across multiple seeded runs demonstrates SmartFlows high efficacy, reducing network imbalance by over 95% while requiring minimal travel distance and achieving strong truck utilisation. A communication layer, powered by a grounded Agentic AI with a Large Language Model (LLM), translates logistical plans into clear, actionable instructions for operational staff, ensuring interpretability and execution readiness. This integration bridges machine intelligence with human operations, offering a scalable solution that reduces idle time, improves bike availability, and lowers operational costs. SmartFlow provides a blueprint for interpretable, AI-driven logistics in complex urban mobility networks.

[517] Quantum Machine Learning Approaches for Coordinated Stealth Attack Detection in Distributed Generation Systems

Osasumwen Cedric Ogiesoba-Eguakun, Suman Rath

Main category: cs.LG

TL;DR: Hybrid quantum-classical model with quantum feature embeddings and classical SVM outperforms classical methods for detecting coordinated stealth attacks in microgrids.

Details

Motivation: Coordinated stealth attacks on distributed generation systems are difficult to detect with standard methods because they mimic normal behavior while manipulating control and measurement signals, posing serious cybersecurity threats to microgrids.

Method: Used simulated measurements to create balanced binary classification dataset with three features (reactive power at DG1, frequency deviation, terminal voltage magnitude). Evaluated classical ML baselines, fully quantum variational classifiers, and hybrid quantum-classical models combining quantum feature embeddings with classical RBF SVM.

Result: Hybrid quantum-classical model achieved best overall performance with modest improvement in accuracy and F1 score over strong classical SVM baseline. Fully quantum models performed worse due to training instability and NISQ hardware limitations.

Conclusion: Hybrid models train more reliably and demonstrate that quantum feature mapping can enhance intrusion detection even when fully quantum learning is not yet practical, showing promise for cybersecurity applications in distributed generation systems.

Abstract: Coordinated stealth attacks are a serious cybersecurity threat to distributed generation systems because they modify control and measurement signals while remaining close to normal behavior, making them difficult to detect using standard intrusion detection methods. This study investigates quantum machine learning approaches for detecting coordinated stealth attacks on a distributed generation unit in a microgrid. High-quality simulated measurements were used to create a balanced binary classification dataset using three features: reactive power at DG1, frequency deviation relative to the nominal value, and terminal voltage magnitude. Classical machine learning baselines, fully quantum variational classifiers, and hybrid quantum classical models were evaluated. The results show that a hybrid quantum classical model combining quantum feature embeddings with a classical RBF support vector machine achieves the best overall performance on this low dimensional dataset, with a modest improvement in accuracy and F1 score over a strong classical SVM baseline. Fully quantum models perform worse due to training instability and limitations of current NISQ hardware. In contrast, hybrid models train more reliably and demonstrate that quantum feature mapping can enhance intrusion detection even when fully quantum learning is not yet practical.

[518] LLMize: A Framework for Large Language Model-Based Numerical Optimization

M. Rizki Oktavian

Main category: cs.LG

TL;DR: LLMize is an open-source Python framework that enables LLM-driven optimization through iterative prompting and in-context learning, allowing natural language formulation of complex optimization problems with domain knowledge injection.

Details

Motivation: LLMs have shown strong reasoning capabilities beyond traditional language tasks, motivating their use for numerical optimization. There's a need for accessible optimization frameworks that can handle complex, domain-specific tasks where constraints and heuristics are difficult to formalize mathematically.

Method: LLMize formulates optimization as a black-box process where candidate solutions are generated in natural language, evaluated by an external objective function, and refined over successive iterations using solution-score feedback. It supports multiple strategies including Optimization by Prompting (OPRO) and hybrid LLM-based methods inspired by evolutionary algorithms and simulated annealing.

Result: The framework was evaluated on convex optimization, linear programming, Traveling Salesman Problem, neural network hyperparameter tuning, and nuclear fuel lattice optimization. Results show LLM-based optimization is not competitive with classical solvers for simple problems but provides a practical approach for complex, domain-specific tasks.

Conclusion: LLMize offers a practical and accessible approach to optimization for complex, domain-specific problems where constraints and heuristics are difficult to formalize mathematically, leveraging LLMs’ natural language capabilities to inject domain knowledge directly.

Abstract: Large language models (LLMs) have recently shown strong reasoning capabilities beyond traditional language tasks, motivating their use for numerical optimization. This paper presents LLMize, an open-source Python framework that enables LLM-driven optimization through iterative prompting and in-context learning. LLMize formulates optimization as a black-box process in which candidate solutions are generated in natural language, evaluated by an external objective function, and refined over successive iterations using solution-score feedback. The framework supports multiple optimization strategies, including Optimization by Prompting (OPRO) and hybrid LLM-based methods inspired by evolutionary algorithms and simulated annealing. A key advantage of LLMize is the ability to inject constraints, rules, and domain knowledge directly through natural language descriptions, allowing practitioners to define complex optimization problems without requiring expertise in mathematical programming or metaheuristic design. LLMize is evaluated on convex optimization, linear programming, the Traveling Salesman Problem, neural network hyperparameter tuning, and nuclear fuel lattice optimization. Results show that while LLM-based optimization is not competitive with classical solvers for simple problems, it provides a practical and accessible approach for complex, domain-specific tasks where constraints and heuristics are difficult to formalize.

[519] LearnAD: Learning Interpretable Rules for Brain Networks in Alzheimer’s Disease Classification

Thomas Andrews, Mark Law, Sara Ahmadi-Abhari, Alessandra Russo

Main category: cs.LG

TL;DR: LearnAD is a neuro-symbolic method for Alzheimer’s disease prediction from MRI data that produces fully interpretable rules while maintaining competitive accuracy with black-box models.

Details

Motivation: The paper addresses the need for interpretable AI models in clinical neuroscience, particularly for Alzheimer's disease diagnosis, where understanding model decisions is crucial for clinical trust and biological insights beyond just predictive accuracy.

Method: LearnAD combines statistical models (Decision Trees, Random Forests, or GNNs) to identify relevant brain connections from MRI data, then uses FastLAS (a symbolic learning system) to learn global, interpretable rules from these identified features.

Result: The best LearnAD instance outperforms Decision Trees, matches SVM accuracy, and performs only slightly below Random Forests and GNNs trained on all features, while remaining fully interpretable. Ablation studies confirm the neuro-symbolic approach maintains comparable performance to pure statistical models while improving interpretability.

Conclusion: LearnAD demonstrates that symbolic learning can provide interpretable Alzheimer’s disease prediction with competitive accuracy, offering deeper understanding of GNN behavior in clinical neuroscience applications.

Abstract: We introduce LearnAD, a neuro-symbolic method for predicting Alzheimer’s disease from brain magnetic resonance imaging data, learning fully interpretable rules. LearnAD applies statistical models, Decision Trees, Random Forests, or GNNs to identify relevant brain connections, and then employs FastLAS to learn global rules. Our best instance outperforms Decision Trees, matches Support Vector Machine accuracy, and performs only slightly below Random Forests and GNNs trained on all features, all while remaining fully interpretable. Ablation studies show that our neuro-symbolic approach improves interpretability with comparable performance to pure statistical models. LearnAD demonstrates how symbolic learning can deepen our understanding of GNN behaviour in clinical neuroscience.

[520] Outlier Detection Using Vector Cosine Similarity by Adding a Dimension

Zhongyang Shen

Main category: cs.LG

TL;DR: Proposes MDOD, a new multi-dimensional outlier detection method using vector cosine similarity on an augmented dataset with an extra dimension.

Details

Motivation: Need for effective outlier detection in multi-dimensional data using geometric relationships rather than traditional distance-based approaches.

Method: Augments original data with zero-valued dimension, creates observation point at origin with non-zero value in new dimension, computes vectors from observation point to measured point and other points, then compares cosine similarities to identify outliers.

Result: Developed MDOD implementation available on PyPI as a Python package for practical use.

Conclusion: Proposes a novel cosine similarity-based approach for multi-dimensional outlier detection with available implementation.

Abstract: We propose a new outlier detection method for multi-dimensional data. The method detects outliers based on vector cosine similarity, using a new dataset constructed by adding a dimension with zero values to the original data. When a point in the new dataset is selected as the measured point, an observation point is created as the origin, differing only in the new dimension by having a non-zero value compared to the measured point. Vectors are then formed from the observation point to the measured point and to other points in the dataset. By comparing the cosine similarities of these vectors, abnormal data can be identified. An optimized implementation (MDOD) is available on PyPI: https://pypi.org/project/mdod/.

[521] FANoS: Friction-Adaptive Nosé–Hoover Symplectic Momentum for Stiff Objectives

Nalin Dhiman

Main category: cs.LG

TL;DR: FANoS is a physics-inspired optimizer combining momentum dynamics, Nosé-Hoover thermostat adaptation, and symplectic integration. It shows promise on stiff nonconvex problems like Rosenbrock but underperforms AdamW and L-BFGS on other benchmarks.

Details

Motivation: To develop an interpretable optimization algorithm by synthesizing structure-preserving integration and thermostat ideas from molecular dynamics into a purely heuristic optimizer for machine learning applications.

Method: FANoS combines: (1) momentum update as discretized second-order dynamical system, (2) Nosé-Hoover-like thermostat variable that adapts friction coefficient using kinetic-energy feedback, (3) semi-implicit symplectic-Euler integrator, optionally with diagonal RMS preconditioner.

Result: On Rosenbrock-100D, FANoS-RMS achieves 1.74×10⁻² vs AdamW’s 48.50 and SGD+momentum’s 90.76, but AdamW with clipping reaches 1.87×10⁻³ and L-BFGS reaches ~4.4×10⁻¹⁰. On ill-conditioned convex quadratics and PINN benchmarks, FANoS underperforms AdamW and shows instability/high variance.

Conclusion: FANoS is an interpretable synthesis of existing ideas that can help on some stiff nonconvex valleys, but is not generally superior to modern baselines and is sensitive to hyperparameter choices.

Abstract: We study a physics-inspired optimizer, \emph{FANoS} (Friction-Adaptive Nosé–Hoover Symplectic momentum), which combines (i) a momentum update written as a discretized second-order dynamical system, (ii) a Nosé–Hoover-like thermostat variable that adapts a scalar friction coefficient using kinetic-energy feedback, and (iii) a semi-implicit (symplectic-Euler) integrator, optionally with a diagonal RMS preconditioner. The method is motivated by structure-preserving integration and thermostat ideas from molecular dynamics, but is used here purely as an optimization heuristic. We provide the algorithm and limited theoretical observations in idealized settings. On the deterministic Rosenbrock-100D benchmark with 3000 gradient evaluations, FANoS-RMS attains a mean final objective value of $1.74\times 10^{-2}$, improving substantially over unclipped AdamW ($48.50$) and SGD+momentum ($90.76$) in this protocol. However, AdamW with gradient clipping is stronger, reaching $1.87\times 10^{-3}$, and L-BFGS reaches $\approx 4.4\times 10^{-10}$. On ill-conditioned convex quadratics and in a small PINN warm-start suite (Burgers and Allen–Cahn), the default FANoS configuration underperforms AdamW and can be unstable or high-variance. Overall, the evidence supports a conservative conclusion: FANoS is an interpretable synthesis of existing ideas that can help on some stiff nonconvex valleys, but it is not a generally superior replacement for modern baselines, and its behavior is sensitive to temperature-schedule and hyperparameter choices.

[522] Hierarchical topological clustering

Ana Carpio, Gema Duro

Main category: cs.LG

TL;DR: Hierarchical topological clustering algorithm works with any distance metric, identifies outliers and arbitrary-shaped clusters from persistence in hierarchy, outperforms other methods on complex datasets.

Details

Motivation: Topological methods can explore data without structural assumptions, but need algorithms that handle arbitrary distances and identify persistent outliers/clusters in complex datasets.

Method: Hierarchical topological clustering algorithm that accepts any distance metric, analyzes persistence of features (outliers/clusters) through the resulting hierarchy.

Result: Demonstrated effectiveness on diverse datasets (images, medical, economic) where outliers are important, showing ability to find meaningful clusters where other methods fail.

Conclusion: The algorithm provides robust topological clustering for complex data with arbitrary distance metrics, successfully identifying persistent outliers and clusters of arbitrary shapes.

Abstract: Topological methods have the potential of exploring data clouds without making assumptions on their the structure. Here we propose a hierarchical topological clustering algorithm that can be implemented with any distance choice. The persistence of outliers and clusters of arbitrary shape is inferred from the resulting hierarchy. We demonstrate the potential of the algorithm on selected datasets in which outliers play relevant roles, consisting of images, medical and economic data. These methods can provide meaningful clusters in situations in which other techniques fail to do so.

[523] When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training

Gihyeon Sim

Main category: cs.LG

TL;DR: PonderTTT uses self-supervised reconstruction loss from TTT layers as a gating mechanism to selectively trigger test-time training updates, achieving 82-89% Oracle Recovery without requiring training or auxiliary networks.

Details

Motivation: Current large language models apply uniform computation to all inputs regardless of difficulty, which is inefficient. There's a need for adaptive computation that can identify when test-time training updates are beneficial without requiring additional training or ground-truth labels.

Method: Proposes PonderTTT, a training-free gating strategy that uses the TTT layer’s self-supervised reconstruction loss to decide when to trigger test-time training updates. Uses only a single scalar threshold calibrated on unlabeled data and continuously adapted via EMA to maintain target update rates. No learned classifier or auxiliary networks required.

Result: Experiments with GPT-2 models (124M to 1.5B) on code language modeling (The Stack v2) show the method achieves 82-89% Oracle Recovery. Significantly outperforms Random Skip baselines (up to 16% lower loss on OOD languages). The signal is inference-compatible and requires no ground-truth labels.

Conclusion: PonderTTT provides an effective, training-free approach for adaptive computation in LLMs by using self-supervised reconstruction loss as a gating mechanism for test-time training, achieving near-oracle performance while being computationally efficient and label-free.

Abstract: Large language models apply uniform computation to all inputs, regardless of difficulty. We propose PonderTTT, a gating strategy using the TTT layer’s self-supervised reconstruction loss to selectively trigger Test-Time Training (TTT) updates. The gating decision itself is training-free–requiring no learned classifier or auxiliary networks; only a single scalar threshold is initially calibrated on unlabeled data and continuously adapted via EMA to maintain target update rates. Our experiments with GPT-2 models (124M to 1.5B) on code language modeling (The Stack v2, teacher-forced perplexity) demonstrate that this signal is inference-compatible, requiring no ground-truth labels. Our Reconstruction Gating achieves 82-89% Oracle Recovery while being fully training-free, significantly outperforming Random Skip baselines (up to 16% lower loss on OOD languages).

[524] A unified framework for geometry-independent operator learning in cardiac electrophysiology simulations

Bei Zhou, Cesare Corrado, Shuang Qian, Maximilian Balmus, Angela W. C. Lee, Cristobal Rodero, Caroline Roney, Marco J. W. Gotte, Luuk H. G. A. Hopman, Mengyun Qiao, Steven Niederer

Main category: cs.LG

TL;DR: A neural operator framework that learns geometry-invariant cardiac electrophysiology by projecting patient-specific heart anatomies onto a standardized anatomical coordinate system, enabling fast predictions across diverse geometries.

Details

Motivation: Existing neural operator approaches for cardiac electrophysiology are limited to structured or weakly deformed domains, restricting their applicability to realistic atrial and ventricular geometries with significant anatomical variability across patients.

Method: Develop a unified operator-learning framework that projects inputs and outputs onto a standardized anatomical coordinate system, decoupling electrophysiological dynamics from mesh topology. This enables geometry-independent learning while preserving physiologically meaningful spatial organization. A GPU-accelerated electrophysiology solver generates over 300,000 high-fidelity simulations across diverse patient-specific left atrial geometries for training.

Result: Achieves mean absolute error of 5.1 ms for predicting full-field local activation time maps with inference time of 0.12 ms per sample, outperforming existing operator learning and convolutional baselines. The framework also demonstrates robust generalization to ventricular geometries beyond the atrial training setting.

Conclusion: The framework establishes a scalable foundation for fast, geometry-invariant cardiac electrophysiology modeling with potential for real-time and population-scale clinical workflows, addressing the fundamental challenge of geometric variability across patient-specific heart anatomies.

Abstract: Learning biophysically accurate solution operators for cardiac electrophysiology is fundamentally challenged by geometric variability across patient-specific heart anatomies. Most existing neural operator approaches are limited to structured or weakly deformed domains, restricting their applicability to realistic atrial and ventricular geometries. Here, we introduce a unified operator-learning framework that projects inputs and outputs onto a standardised anatomical coordinate system, decoupling electrophysiological dynamics from mesh topology. This formulation enables geometry-independent learning while preserving physiologically meaningful spatial organisation, and allows predictions to be interpolated back onto patient-specific geometries for anatomical interpretation. To support large-scale training within the framework, we develop a GPU-accelerated electrophysiology solver and generate over 300,000 high-fidelity simulations across diverse patient-specific left atrial geometries with varied pacing and conduction properties. Within this anatomical coordinate domain, we design a neural operator to predict full-field local activation time maps, achieving a mean absolute error of 5.1 ms and an inference time of 0.12 ms per sample, outperforming existing operator learning and convolutional baselines. We further validate the framework on ventricular geometries, demonstrating robust generalisation beyond the atrial setting. Together, this framework establishes a scalable foundation for fast, geometry-invariant cardiac electrophysiology modelling, with potential relevance for real-time and population-scale clinical workflows.

[525] Dichotomous Diffusion Policy Optimization

Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan

Main category: cs.LG

TL;DR: DIPOLE is a novel RL algorithm for stable diffusion policy optimization that decomposes optimal policy into dichotomous policies (reward maximization and minimization) for flexible greediness control.

Details

Motivation: Existing diffusion policy RL methods face instability from direct value maximization or computational issues from Gaussian approximations requiring many small denoising steps.

Method: Revisits KL-regularized RL objective, formulates greedified policy regularization, decomposes optimal policy into dichotomous policies (maximization and minimization), enables linear combination of scores during inference.

Result: Effective in offline and offline-to-online RL on ExORL and OGBench benchmarks. Successfully trains large VLA model for autonomous driving on NAVSIM benchmark.

Conclusion: DIPOLE enables stable and controllable diffusion policy optimization with flexible greediness control, showing potential for complex real-world applications.

Abstract: Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.

[526] Conformal Prediction Under Distribution Shift: A COVID-19 Natural Experiment

Chorok Lee

Main category: cs.LG

TL;DR: Conformal prediction coverage degrades under distribution shift, with catastrophic failures linked to single-feature dependence. COVID-19 supply chain study shows coverage drops vary widely (0-86.7%), with quarterly retraining helping vulnerable tasks but not robust ones.

Details

Motivation: To understand why conformal prediction guarantees degrade under distribution shift, particularly using COVID-19 as a natural experiment to study real-world supply chain disruptions.

Method: Study of 8 supply chain tasks during COVID-19 with severe feature turnover (Jaccard ~0). Used SHAP analysis to measure feature importance concentration. Quarterly retraining experiments and exploratory analysis of 4 additional tasks with moderate feature stability.

Result: Coverage drops varied dramatically (0-86.7%). Catastrophic failures correlated with single-feature dependence (rho=0.714, p=0.047). Quarterly retraining restored catastrophic task coverage from 22% to 41% (+19 pp, p=0.04) but provided no benefit for robust tasks (99.8% coverage). Feature stability, not concentration, determines robustness for moderate shifts.

Conclusion: A decision framework: monitor SHAP concentration before deployment; retrain quarterly if vulnerable (>40% concentration); skip retraining if robust. Concentration effects apply specifically to severe distribution shifts.

Abstract: Conformal prediction guarantees degrade under distribution shift. We study this using COVID-19 as a natural experiment across 8 supply chain tasks. Despite identical severe feature turnover (Jaccard approximately 0), coverage drops vary from 0% to 86.7%, spanning two orders of magnitude. Using SHapley Additive exPlanations (SHAP) analysis, we find catastrophic failures correlate with single-feature dependence (rho = 0.714, p = 0.047). Catastrophic tasks concentrate importance in one feature (4.5x increase), while robust tasks redistribute across many (10-20x). Quarterly retraining restores catastrophic task coverage from 22% to 41% (+19 pp, p = 0.04), but provides no benefit for robust tasks (99.8% coverage). Exploratory analysis of 4 additional tasks with moderate feature stability (Jaccard 0.13-0.86) reveals feature stability, not concentration, determines robustness, suggesting concentration effects apply specifically to severe shifts. We provide a decision framework: monitor SHAP concentration before deployment; retrain quarterly if vulnerable (>40% concentration); skip retraining if robust.

[527] Latent-Constrained Conditional VAEs for Augmenting Large-Scale Climate Ensembles

Jacquelyn Shelton, Przemyslaw Polewski, Alexander Robel, Matthew Hoffman, Stephen Price

Main category: cs.LG

TL;DR: LC-CVAE with latent constraints improves climate ensemble generation by enforcing cross-realization homogeneity at anchor locations, enabling better generalization to unseen ensemble members.

Details

Motivation: Large climate-model ensembles are computationally expensive, but many downstream analyses need additional statistically consistent realizations. Existing approaches using vanilla CVAEs trained across realizations fail to generalize to unseen ensemble members due to fragmented latent spaces.

Method: Propose Latent-Constrained CVAE (LC-CVAE) that enforces cross-realization homogeneity of latent embeddings at shared geographic ‘anchor’ locations. Use multi-output Gaussian process regression in latent space to predict latent coordinates at unsampled locations, then decode to generate full time series fields.

Result: Experiments show: (i) instability when training on single realization, (ii) diminishing returns after ~5 realizations, (iii) trade-off between spatial coverage and reconstruction quality linked to average neighbor distance in latent space.

Conclusion: LC-CVAE with latent constraints effectively generates new climate realizations from limited ensemble runs by transferring structure learned across realizations, addressing generalization issues of vanilla CVAEs.

Abstract: Large climate-model ensembles are computationally expensive; yet many downstream analyses would benefit from additional, statistically consistent realizations of spatiotemporal climate variables. We study a generative modeling approach for producing new realizations from a limited set of available runs by transferring structure learned across an ensemble. Using monthly near-surface temperature time series from ten independent reanalysis realizations (ERA5), we find that a vanilla conditional variational autoencoder (CVAE) trained jointly across realizations yields a fragmented latent space that fails to generalize to unseen ensemble members. To address this, we introduce a latent-constrained CVAE (LC-CVAE) that enforces cross-realization homogeneity of latent embeddings at a small set of shared geographic ‘anchor’ locations. We then use multi-output Gaussian process regression in the latent space to predict latent coordinates at unsampled locations in a new realization, followed by decoding to generate full time series fields. Experiments and ablations demonstrate (i) instability when training on a single realization, (ii) diminishing returns after incorporating roughly five realizations, and (iii) a trade-off between spatial coverage and reconstruction quality that is closely linked to the average neighbor distance in latent space.

[528] Attention Needs to Focus: A Unified Perspective on Attention Allocation

Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

Main category: cs.LG

TL;DR: Lazy Attention addresses attention mechanism failures (representational collapse and attention sink) through a unified approach with positional discrimination and Elastic-Softmax, achieving competitive performance and high sparsity.

Details

Motivation: Standard Transformer attention suffers from representational collapse and attention sink issues, which prior work addresses in isolation. The paper aims to provide a unified perspective showing both problems stem from improper attention allocation.

Method: Proposes Lazy Attention with two key components: 1) Positional discrimination across heads and dimensions to mitigate attention overload, and 2) Elastic-Softmax normalization to relax softmax constraints and suppress attention on irrelevant tokens for attention underload.

Result: Experiments on FineWeb-Edu corpus across nine benchmarks show Lazy Attention successfully mitigates attention sink, achieves competitive performance compared to standard attention and modern architectures, and reaches up to 59.58% attention sparsity.

Conclusion: Lazy Attention provides a unified solution to attention mechanism failures by addressing both overload and underload through novel techniques, demonstrating practical improvements in performance and efficiency.

Abstract: The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root – improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

[529] MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs

Xingsheng Chen, Regina Zhang, Bo Gao, Xingwei He, Xiaofeng Liu, Pietro Lio, Kwok-Yan Lam, Siu-Ming Yiu

Main category: cs.LG

TL;DR: MODE: A unified framework combining Low-Rank Neural ODEs with Enhanced Mamba architecture for efficient and accurate long-term time series prediction with irregular sampling.

Details

Motivation: Existing time series prediction methods struggle to balance efficiency, scalability, and accuracy, especially for long-range dependencies and irregularly sampled data across domains like finance, healthcare, energy, and environmental modeling.

Method: MODE integrates Low-Rank Neural ODEs with Enhanced Mamba architecture. Input sequences go through Linear Tokenization Layer, then multiple Mamba Encoder blocks with Enhanced Mamba Layers using Causal Convolution, SiLU activation, and Low-Rank Neural ODE enhancement. Includes segmented selective scanning mechanism inspired by pseudo-ODE dynamics for adaptive focus on salient subsequences.

Result: Extensive experiments on benchmark datasets show MODE surpasses existing baselines in both predictive accuracy and computational efficiency.

Conclusion: MODE provides a unified efficient architecture for long-term time series modeling, integrates Mamba’s selective scanning with low-rank Neural ODEs for enhanced temporal representation, and achieves substantial improvements in efficiency and scalability through low-rank approximation and dynamic selective scanning.

Abstract: Time series prediction plays a pivotal role across diverse domains such as finance, healthcare, energy systems, and environmental modeling. However, existing approaches often struggle to balance efficiency, scalability, and accuracy, particularly when handling long-range dependencies and irregularly sampled data. To address these challenges, we propose MODE, a unified framework that integrates Low-Rank Neural Ordinary Differential Equations (Neural ODEs) with an Enhanced Mamba architecture. As illustrated in our framework, the input sequence is first transformed by a Linear Tokenization Layer and then processed through multiple Mamba Encoder blocks, each equipped with an Enhanced Mamba Layer that employs Causal Convolution, SiLU activation, and a Low-Rank Neural ODE enhancement to efficiently capture temporal dynamics. This low-rank formulation reduces computational overhead while maintaining expressive power. Furthermore, a segmented selective scanning mechanism, inspired by pseudo-ODE dynamics, adaptively focuses on salient subsequences to improve scalability and long-range sequence modeling. Extensive experiments on benchmark datasets demonstrate that MODE surpasses existing baselines in both predictive accuracy and computational efficiency. Overall, our contributions include: (1) a unified and efficient architecture for long-term time series modeling, (2) integration of Mamba’s selective scanning with low-rank Neural ODEs for enhanced temporal representation, and (3) substantial improvements in efficiency and scalability enabled by low-rank approximation and dynamic selective scanning.

[530] Practical Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease

Azadeh Alavi, Hamidreza Khalili, Stanley H. Chan, Fatemeh Kouchmeshki, Ross Vlahos

Main category: cs.LG

TL;DR: Quantum and geometric kernel methods improve muscle outcome prediction in COPD preclinical models using minimal biomarkers, outperforming classical baselines while maintaining interpretability.

Details

Motivation: Skeletal muscle dysfunction is a clinically important extra-pulmonary manifestation of COPD linked to inflammation, motivating predictive modeling from minimally invasive biomarkers that can be tracked longitudinally.

Method: Benchmarked classical baselines, geometry-aware symmetric positive definite descriptors with Stein divergence, and quantum kernel models on a small preclinical dataset (213 animals) with blood/BAL measurements predicting muscle weight, specific force, and muscle quality index.

Result: Quantum kernel ridge regression with 4 interpretable inputs achieved test RMSE of 4.41 mg and R² of 0.605 for muscle weight, outperforming ridge baseline (4.70 mg, 0.553). Geometry-informed Stein divergence also showed consistent gains. Screening evaluation achieved ROC-AUC up to 0.90 for detecting low muscle weight.

Conclusion: Geometric and quantum kernel methods provide measurable benefits in low-data, low-feature biomedical prediction while preserving interpretability and transparent model selection, indicating their potential for biomarker-based muscle outcome prediction in COPD.

Abstract: Skeletal muscle dysfunction is a clinically relevant extra-pulmonary manifestation of chronic obstructive pulmonary disease (COPD) and is closely linked to systemic and airway inflammation. This motivates predictive modelling of muscle outcomes from minimally invasive biomarkers that can be acquired longitudinally. We study a small-sample preclinical dataset comprising 213 animals across two conditions (Sham versus cigarette-smoke exposure), with blood and bronchoalveolar lavage fluid measurements and three continuous targets: tibialis anterior muscle weight (milligram: mg), specific force (millinewton: mN), and a derived muscle quality index (mN per mg). We benchmark tuned classical baselines, geometry-aware symmetric positive definite (SPD) descriptors with Stein divergence, and quantum kernel models designed for low-dimensional tabular data. In the muscle-weight setting, quantum kernel ridge regression using four interpretable inputs (blood C-reactive protein, neutrophil count, bronchoalveolar lavage cellularity, and condition) attains a test root mean squared error of 4.41 mg and coefficient of determination of 0.605, improving over a matched ridge baseline on the same feature set (4.70 mg and 0.553). Geometry-informed Stein-divergence prototype distances yield a smaller but consistent gain in the biomarker-only setting (4.55 mg versus 4.79 mg). Screening-style evaluation, obtained by thresholding the continuous outcome at 0.8 times the training Sham mean, achieves an area under the receiver operating characteristic curve (ROC-AUC) of up to 0.90 for detecting low muscle weight. These results indicate that geometric and quantum kernel lifts can provide measurable benefits in low-data, low-feature biomedical prediction problems, while preserving interpretability and transparent model selection.

[531] Complexity-based code embeddings

Rares Folea, Radu Iacob, Emil Slusanschi, Traian Rebedea

Main category: cs.LG

TL;DR: The paper presents a method to convert source code into numerical embeddings using dynamic analysis and complexity functions, then applies these embeddings with XGBoost for multi-label classification of programming competition code snippets.

Details

Motivation: To create a generic approach for representing source code as numerical embeddings that capture algorithmic behavior and complexity, enabling better machine learning applications on code analysis tasks.

Method: Transform source code to numerical embeddings by dynamically analyzing program behavior against different inputs and tailoring multiple generic complexity functions for the analyzed metrics, using r-Complexity based embeddings.

Result: Achieves an average F1-score on a multi-label dataset with 11 classes using XGBoost algorithm, with dataset built from real-world code snippets submitted to Codeforces programming competitions.

Conclusion: The proposed code embedding method effectively captures algorithmic characteristics and enables successful multi-label classification of programming competition solutions using machine learning.

Abstract: This paper presents a generic method for transforming the source code of various algorithms to numerical embeddings, by dynamically analysing the behaviour of computer programs against different inputs and by tailoring multiple generic complexity functions for the analysed metrics. The used algorithms embeddings are based on r-Complexity . Using the proposed code embeddings, we present an implementation of the XGBoost algorithm that achieves an average F1-score on a multi-label dataset with 11 classes, built using real-world code snippets submitted for programming competitions on the Codeforces platform.

[532] Enhanced Data-Driven Product Development via Gradient Based Optimization and Conformalized Monte Carlo Dropout Uncertainty Estimation

Andrea Thomas Nava, Lijo Johny, Fabio Azzalini, Johannes Schneider, Arianna Casanova

Main category: cs.LG

TL;DR: A data-driven product development framework using neural networks with gradient descent optimization and conformalized uncertainty estimation for multi-property optimization.

Details

Motivation: Many products require simultaneous optimization of multiple correlated properties, and existing methods lack proper uncertainty quantification with finite-sample guarantees for data-driven product development.

Method: Train joint neural networks on past experiments to capture interdependencies among targets, use Projected Gradient Descent to optimize designs, and integrate Conformalised Monte Carlo Dropout (ConfMC) for uncertainty estimation with coverage guarantees.

Result: Extensive experiments on five real-world datasets show the method matches state-of-the-art performance while providing adaptive, non-uniform prediction intervals and eliminating retraining needs when adjusting coverage levels.

Conclusion: The proposed framework effectively combines optimization with reliable uncertainty quantification for multi-property product design, offering practical advantages in real-world applications.

Abstract: Data-Driven Product Development (DDPD) leverages data to learn the relationship between product design specifications and resulting properties. To discover improved designs, we train a neural network on past experiments and apply Projected Gradient Descent to identify optimal input features that maximize performance. Since many products require simultaneous optimization of multiple correlated properties, our framework employs joint neural networks to capture interdependencies among targets. Furthermore, we integrate uncertainty estimation via \emph{Conformalised Monte Carlo Dropout} (ConfMC), a novel method combining Nested Conformal Prediction with Monte Carlo dropout to provide model-agnostic, finite-sample coverage guarantees under data exchangeability. Extensive experiments on five real-world datasets show that our method matches state-of-the-art performance while offering adaptive, non-uniform prediction intervals and eliminating the need for retraining when adjusting coverage levels.

[533] LOFA: Online Influence Maximization under Full-Bandit Feedback using Lazy Forward Selection

Jinyu Xu, Abhishek K. Umrawal

Main category: cs.LG

TL;DR: LOFA (Lazy Online Forward Algorithm) for online influence maximization with full-bandit feedback achieves lower empirical regret by leveraging submodularity.

Details

Motivation: Existing online influence maximization algorithms exploit submodularity but still have room for improvement in empirical regret. The authors aim to develop an algorithm that achieves lower regret in the full-bandit feedback setting where only the influence of chosen seed sets is observed.

Method: Proposes Lazy Online Forward Algorithm (LOFA) that leverages the submodular property of influence functions more effectively. The algorithm operates in an online setting with full-bandit feedback (only observes influence of chosen seed sets, no network structure information).

Result: LOFA achieves superior performance compared to existing bandit algorithms in terms of cumulative regret and instantaneous reward, as demonstrated through experiments on a real-world social network.

Conclusion: The proposed LOFA algorithm effectively leverages submodularity to achieve lower empirical regret in online influence maximization with full-bandit feedback, outperforming existing approaches.

Abstract: We study the problem of influence maximization (IM) in an online setting, where the goal is to select a subset of nodes$\unicode{x2014}$called the seed set$\unicode{x2014}$at each time step over a fixed time horizon, subject to a cardinality budget constraint, to maximize the expected cumulative influence. We operate under a full-bandit feedback model, where only the influence of the chosen seed set at each time step is observed, with no additional structural information about the network or diffusion process. It is well-established that the influence function is submodular, and existing algorithms exploit this property to achieve low regret. In this work, we leverage this property further and propose the Lazy Online Forward Algorithm (LOFA), which achieves a lower empirical regret. We conduct experiments on a real-world social network to demonstrate that LOFA achieves superior performance compared to existing bandit algorithms in terms of cumulative regret and instantaneous reward.

[534] Reliability Under Randomness: An Empirical Analysis of Sparse and Dense Language Models Across Decoding Temperatures

Kabir Grover

Main category: cs.LG

TL;DR: Sparse MoE models with instruction tuning show decoding stability comparable to dense models, while sparse base models degrade with temperature increases - instruction tuning matters more than architectural sparsity for reliability.

Details

Motivation: To investigate whether conditional computation in sparse Mixture-of-Experts architectures amplifies decoding-induced randomness and compromises output stability compared to dense models, especially for reliability-critical applications.

Method: Evaluated three models: OLMoE-7B (sparse base), Mixtral-8x7B (sparse instruction-tuned), and Qwen2.5-3B (dense instruction-tuned) on deterministic arithmetic reasoning tasks. Tested four decoding configurations from greedy to T=1.0, measuring accuracy, format compliance, consistency, and confidence across 9,360 generations.

Result: Sparse instruction-tuned model (Mixtral-8x7B) showed stability comparable to dense instruction-tuned model across all temperatures, while sparse base model (OLMoE-7B) exhibited systematic degradation as temperature increased.

Conclusion: Instruction tuning, not architectural sparsity, is the primary determinant of robustness to decoding randomness on deterministic tasks. Sparse architectures can be safely adopted without sacrificing stability when properly instruction-tuned.

Abstract: The increasing prevalence of sparse Mixture-of-Experts (MoE) architectures in large language models raises important questions regarding their reliability under stochastic decoding. While conditional computation enables substantial gains in computational efficiency, it remains unclear whether the interaction between sparse routing and temperature-based sampling compromises output stability relative to dense architectures. This work investigates whether conditional computation in MoE models amplifies decoding-induced randomness, leading to reduced reliability as temperature increases. We evaluate three representative models: OLMoE-7B (sparse base), Mixtral-8x7B (sparse instruction-tuned), and Qwen2.5-3B (dense instruction-tuned) on deterministic arithmetic reasoning tasks with objectively verifiable answers. Experiments span four decoding configurations, ranging from greedy decoding to T=1.0. Our evaluation encompasses accuracy, format compliance, output consistency across repeated generations, and confidence metrics, totaling 9,360 model generations. Results demonstrate that the sparse instruction-tuned model exhibits stability comparable to the dense instruction-tuned model across all decoding temperatures, while the sparse base model shows systematic degradation as temperature increases. These findings indicate that instruction tuning, rather than architectural sparsity, is the primary determinant of robustness to decoding randomness on deterministic tasks. We discuss the implications of these results for deploying sparse language models in reliability-critical applications, highlighting scenarios in which sparse architectures can be safely adopted without sacrificing output stability.

[535] Adapting Feature Attenuation to NLP

Tianshuo Yang, Ryan Rabinowitz, Terrance E. Boult, Jugal Kalita

Main category: cs.LG

TL;DR: Transformer classifiers like BERT perform well on known categories but struggle with unseen ones. The study adapts computer vision’s COSTARR framework to BERT and GPT-2 for open-set recognition in text, finding it doesn’t significantly outperform simpler baselines like MaxLogit or MSP in this high-class-count setting.

Details

Motivation: Transformer classifiers achieve impressive closed-set accuracy but are brittle when encountering inputs from unseen categories, which is a common scenario in deployed NLP systems. The paper investigates Open-Set Recognition (OSR) for text by adapting computer vision approaches to language models.

Method: Adapted the COSTARR framework (originally for computer vision) to two language models (BERT base and GPT-2) trained on 176 arXiv subject areas. Evaluated COSTARR alongside Maximum Softmax Probability (MSP), MaxLogit, and temperature-scaled free-energy score using OOSA and AUOSCR metrics.

Result: COSTARR extends to NLP without retraining but yields no statistically significant gain over MaxLogit or MSP. Free-energy score lags behind all other scores in this high-class-count setting. The adaptation shows promise but has current limitations.

Conclusion: The study highlights both the promise and current limitations of transplanting vision-centric OSR ideas to language models. It points toward the need for larger backbones and task-tailored attenuation strategies for better open-set recognition in NLP.

Abstract: Transformer classifiers such as BERT deliver impressive closed-set accuracy, yet they remain brittle when confronted with inputs from unseen categories–a common scenario for deployed NLP systems. We investigate Open-Set Recognition (OSR) for text by porting the feature attenuation hypothesis from computer vision to transformers and by benchmarking it against state-of-the-art baselines. Concretely, we adapt the COSTARR framework–originally designed for classification in computer vision–to two modest language models (BERT (base) and GPT-2) trained to label 176 arXiv subject areas. Alongside COSTARR, we evaluate Maximum Softmax Probability (MSP), MaxLogit, and the temperature-scaled free-energy score under the OOSA and AUOSCR metrics. Our results show (i) COSTARR extends to NLP without retraining but yields no statistically significant gain over MaxLogit or MSP, and (ii) free-energy lags behind all other scores in this high-class-count setting. The study highlights both the promise and the current limitations of transplanting vision-centric OSR ideas to language models, and points toward the need for larger backbones and task-tailored attenuation strategies.

Longwei Wang, Mohammad Navid Nayyem, Abdullah Al Rakin, KC Santosh, Chaowei Zhang, Yang Zhou

Main category: cs.LG

TL;DR: The paper introduces an attribution-guided refinement framework that uses LIME explanations to identify and suppress spurious features during training, improving both adversarial robustness and interpretability.

Details

Motivation: Deep learning models in safety-critical domains need both robustness to adversarial attacks and transparent decision-making. The authors identify that spurious features revealed by interpretability methods like LIME contribute disproportionately to adversarial vulnerability, creating an opportunity to leverage interpretability for robustness.

Method: An attribution-guided refinement framework transforms LIME from diagnostic tool to active training signal. The method uses feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop pipeline to systematically suppress spurious features identified through LIME explanations.

Result: Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 show substantial improvements in adversarial robustness and out-of-distribution generalization. The approach doesn’t require additional datasets or model architectures and integrates with standard adversarial training.

Conclusion: The paper demonstrates a direct connection between interpretability and robustness, showing that actively suppressing spurious features identified through interpretability methods can significantly enhance model robustness while maintaining transparency.

Abstract: The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.

[537] Zero-shot Forecasting by Simulation Alone

Boris N. Oreshkin, Mayank Jauhari, Ravi Kiran Selvam, Malcolm Wolff, Wenhao Pan, Shankar Ramasubramanian, Kin G. Olivares, Tatiana Konstantinova, Andres Potapczynski, Mengfei Cao, Dmitry Efimov, Michael W. Mahoney, Andrew G. Wilson

Main category: cs.LG

TL;DR: SarSim0: A fast SARIMA-based simulator for generating synthetic time series that enables strong zero-shot forecasting performance without real data.

Details

Motivation: Address limitations in zero-shot time-series forecasting: limited/biased real data, leakage-prone evaluation, and privacy/licensing constraints. Need for practical synthetic data generation.

Method: Three-step SARIMA-based simulation: (1) sample stable trajectories from characteristic polynomial region, (2) superposition of multiple paths for multi-seasonality, (3) add heavy-tailed noise for burstiness/intermittency.

Result: SarSim0 is orders of magnitude faster than kernel-based generators, enables training on ~1B simulated series, and achieves strong zero-shot performance on M-Series/GiftEval benchmarks, surpassing statistical forecasters and foundation models.

Conclusion: SarSim0 provides a practical solution for zero-shot forecasting by generating high-quality synthetic data, enabling neural networks to generalize well without real training data, even showing “student-beats-teacher” effects on GiftEval.

Abstract: Zero-shot time-series forecasting holds great promise, but is still in its infancy, hindered by limited and biased data corpora, leakage-prone evaluation, and privacy and licensing constraints. Motivated by these challenges, we propose the first practical univariate time series simulation pipeline which is simultaneously fast enough for on-the-fly data generation and enables notable zero-shot forecasting performance on M-Series and GiftEval benchmarks that capture trend/seasonality/intermittency patterns, typical of industrial forecasting applications across a variety of domains. Our simulator, which we call SarSim0 (SARIMA Simulator for Zero-Shot Forecasting), is based off of a seasonal autoregressive integrated moving average (SARIMA) model as its core data source. Due to instability in the autoregressive component, naive SARIMA simulation often leads to unusable paths. Instead, we follow a three-step procedure: (1) we sample well-behaved trajectories from its characteristic polynomial stability region; (2) we introduce a superposition scheme that combines multiple paths into rich multi-seasonality traces; and (3) we add rate-based heavy-tailed noise models to capture burstiness and intermittency alongside seasonalities and trends. SarSim0 is orders of magnitude faster than kernel-based generators, and it enables training on circa 1B unique purely simulated series, generated on the fly; after which well-established neural network backbones exhibit strong zero-shot generalization, surpassing strong statistical forecasters and recent foundation baselines, while operating under strict zero-shot protocol. Notably, on GiftEval we observe a “student-beats-teacher” effect: models trained on our simulations exceed the forecasting accuracy of the AutoARIMA generating processes.

[538] Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations

Amin Abyaneh, Charlotte Morissette, Mohamad H. Danesh, Anas El Houssaini, David Meger, Gregory Dudek, Hsiu-Chin Lin

Main category: cs.LG

TL;DR: Contractive Diffusion Policies (CDPs) enhance diffusion policies for offline RL by making the sampling dynamics contractive, improving robustness to errors and reducing unwanted action variance.

Details

Motivation: Diffusion policies suffer from solver/score-matching errors, large data requirements, and inconsistent action generation in continuous control settings. These inaccuracies compound and lead to failure, making diffusion policies less reliable for control tasks compared to image generation.

Method: Introduce Contractive Diffusion Policies (CDPs) that induce contractive behavior in diffusion sampling dynamics. This pulls nearby flows closer together, enhancing robustness against errors. Provide theoretical analysis and practical implementation recipe that can be incorporated into existing diffusion policy architectures with minimal modification and computational cost.

Result: CDPs outperform baseline policies across benchmarks, with particularly pronounced benefits under data scarcity conditions. Extensive experiments conducted in both simulation and real-world settings demonstrate improved performance.

Conclusion: Contractive Diffusion Policies address key limitations of diffusion policies in continuous control by making sampling dynamics more robust and stable, especially valuable in data-scarce scenarios while maintaining the flexibility of diffusion-based approaches.

Abstract: Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characterized by a score function guiding a Stochastic Differential Equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce Contractive Diffusion Policies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real-world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.

[539] Data-Driven Assessment of Concrete Mixture Compositions on Chloride Transport via Standalone Machine Learning Algorithms

Mojtaba Aliasghar-Mamaghani, Mohammadreza Khalafi

Main category: cs.LG

TL;DR: This paper uses machine learning to predict how concrete mixture compositions affect chloride ingress over time, which is crucial for assessing infrastructure service life in corrosive environments.

Details

Motivation: Understanding how concrete mixture compositions influence chloride ingress is critical for predicting service life of civil infrastructure in aggressive environments, but traditional methods may not capture complex hidden correlations effectively.

Method: Used both simple (linear regression, KNN, kernel ridge regression) and complex (SVR, Gaussian process regression, MLP, GRU) machine learning algorithms on comprehensive concrete mixture data to predict temporal chloride evolution and uncover hidden correlations.

Result: Kernel ridge regression, Gaussian process regression, and MLP showed high accuracy; GRU performed poorly on test data; GPR provided clear explainable trends; most mixture components inversely correlate with chloride content while few show direct correlation.

Conclusion: Machine learning approaches, particularly GPR, can effectively serve as surrogate models for understanding chloride ingress physics and correlations, supporting enhanced service life prediction for civil infrastructure.

Abstract: This paper employs a data-driven approach to determine the impact of concrete mixture compositions on the temporal evolution of chloride in concrete structures. This is critical for assessing the service life of civil infrastructure subjected to aggressive environments. The adopted methodology relies on several simple and complex standalone machine learning (ML) algorithms, with the primary objective of establishing confidence in the unbiased prediction of the underlying hidden correlations. The simple algorithms include linear regression (LR), k-nearest neighbors (KNN) regression, and kernel ridge regression (KRR). The complex algorithms entail support vector regression (SVR), Gaussian process regression (GPR), and two families of artificial neural networks, including a feedforward network (multilayer perceptron, MLP) and a gated recurrent unit (GRU). The MLP architecture cannot explicitly handle sequential data, a limitation addressed by the GRU. A comprehensive dataset is considered. The performance of ML algorithms is evaluated, with KRR, GPR, and MLP exhibiting high accuracy. Given the diversity of the adopted concrete mixture proportions, the GRU was unable to accurately reproduce the response in the test set. Further analyses elucidate the contributions of mixture compositions to the temporal evolution of chloride. The results obtained from the GPR model unravel latent correlations through clear and explainable trends. The MLP, SVR, and KRR also provide acceptable estimates of the overall trends. The majority of mixture components exhibit an inverse relation with chloride content, while a few components demonstrate a direct correlation. These findings highlight the potential of surrogate approaches for describing the physical processes involved in chloride ingress and the associated correlations, toward the ultimate goal of enhancing the service life of civil infrastructure.

[540] Geometric and Dynamic Scaling in Deep Transformers

Haoran Su, Chenyu You

Main category: cs.LG

TL;DR: Deep Transformers suffer from representation collapse due to geometric drift and monotonic feature accumulation, not just optimization issues. The paper proposes Manifold-Geometric Transformer (MGT) with manifold-constrained updates and deep delta learning to prevent collapse.

Details

Motivation: Existing explanations for Transformer collapse (optimization instability, vanishing gradients) fail to explain why collapse persists even with modern normalization/initialization. The paper argues collapse is fundamentally a geometric problem where standard residual updates cause systematic drift off semantic manifolds and monotonic feature accumulation.

Method: Proposes Manifold-Geometric Transformer (MGT) with two orthogonal principles: 1) Manifold-constrained hyper-connections restrict residual updates to valid local tangent directions to prevent manifold drift; 2) Deep delta learning introduces data-dependent, non-monotonic updates enabling reflection and erasure of redundant features rather than unconditional accumulation.

Result: The framework decouples direction and sign of feature updates, yielding stable geometric evolution across depth. Analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks.

Conclusion: Geometry, not depth itself, is the key limiting factor in deep representation learning. The paper outlines evaluation protocol for Transformers exceeding 100 layers to test this hypothesis, proposing that addressing geometric failures enables stable ultra-deep architectures.

Abstract: Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.

[541] Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study

Ata Akbari Asanjan, Milad Memarzadeh, Bryan Matthews, Nikunj Oza

Main category: cs.LG

TL;DR: RFT helps DNNs learn both low and high frequencies simultaneously, improving reconstruction-based anomaly detection compared to conventional methods.

Details

Motivation: To improve training and inference of DNNs (AEs/VAEs) by using Random Fourier Transformation to address frequency learning limitations in conventional networks.

Method: Applied RFT to AEs/VAEs, analyzed training behavior using Frequency Principle, introduced trainable RFT variant, tested on synthetic datasets and aviation safety dataset.

Result: Models with Fourier transformation outperform conventional counterparts; trainable RFT benefits remain inconclusive compared to random variant.

Conclusion: RFT enhances DNN performance by enabling simultaneous learning of different frequencies, but further research needed on trainable Fourier transformations.

Abstract: In this study, we focus on the training process and inference improvements of deep neural networks (DNNs), specifically Autoencoders (AEs) and Variational Autoencoders (VAEs), using Random Fourier Transformation (RFT). We further explore the role of RFT in model training behavior using Frequency Principle (F-Principle) analysis and show that models with RFT turn to learn low frequency and high frequency at the same time, whereas conventional DNNs start from low frequency and gradually learn (if successful) high-frequency features. We focus on reconstruction-based anomaly detection using autoencoder and variational autoencoder and investigate the RFT’s role. We also introduced a trainable variant of RFT that uses the existing computation graph to train the expansion of RFT instead of it being random. We showcase our findings with two low-dimensional synthetic datasets for data representation, and an aviation safety dataset, called Dashlink, for high-dimensional reconstruction-based anomaly detection. The results indicate the superiority of models with Fourier transformation compared to the conventional counterpart and remain inconclusive regarding the benefits of using trainable Fourier transformation in contrast to the Random variant.

[542] Expanding the Chaos: Neural Operator for Stochastic (Partial) Differential Equations

Dai Shi, Lequan Lin, Andi Han, Luke Thompson, José Miguel Hernández-Lobato, Zhiyong Wang, Junbin Gao

Main category: cs.LG

TL;DR: Neural operators based on Wiener chaos expansions for learning SDE/SPDE solution operators, enabling single-pass trajectory reconstruction from noise with competitive accuracy across diverse applications.

Details

Motivation: SDEs and SPDEs are fundamental for modeling stochastic dynamics, and developing deep learning models for approximating their solution operators can provide fast solvers and inspire new perspectives on classical learning tasks.

Method: Build on Wiener chaos expansions (WCE) to design neural operator architectures: project driving noise onto orthonormal Wick Hermite features, parameterize deterministic chaos coefficients with neural operators, enabling full solution trajectory reconstruction from noise in a single forward pass.

Result: Validated on diverse problems including classical SPDE benchmarks, diffusion one-step sampling on images, topological interpolation on graphs, financial extrapolation, parameter estimation, and manifold SDEs for flood prediction, demonstrating competitive accuracy and broad applicability.

Conclusion: WCE-based neural operators provide a practical and scalable way to learn SDE/SPDE solution operators across diverse domains, with theoretical foundations in classical WCE results and explicit separation between stochastic forcing and deterministic dynamics.

Abstract: Stochastic differential equations (SDEs) and stochastic partial differential equations (SPDEs) are fundamental tools for modeling stochastic dynamics across the natural sciences and modern machine learning. Developing deep learning models for approximating their solution operators promises not only fast, practical solvers, but may also inspire models that resolve classical learning tasks from a new perspective. In this work, we build on classical Wiener chaos expansions (WCE) to design neural operator (NO) architectures for SPDEs and SDEs: we project the driving noise paths onto orthonormal Wick Hermite features and parameterize the resulting deterministic chaos coefficients with neural operators, so that full solution trajectories can be reconstructed from noise in a single forward pass. On the theoretical side, we investigate the classical WCE results for the class of multi-dimensional SDEs and semilinear SPDEs considered here by explicitly writing down the associated coupled ODE/PDE systems for their chaos coefficients, which makes the separation between stochastic forcing and deterministic dynamics fully explicit and directly motivates our model designs. On the empirical side, we validate our models on a diverse suite of problems: classical SPDE benchmarks, diffusion one-step sampling on images, topological interpolation on graphs, financial extrapolation, parameter estimation, and manifold SDEs for flood prediction, demonstrating competitive accuracy and broad applicability. Overall, our results indicate that WCE-based neural operators provide a practical and scalable way to learn SDE/SPDE solution operators across diverse domains.

[543] Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning

João Morais, Sadjad Alikhani, Akshay Malhotra, Shahab Hamidi-Rad, Ahmed Alkhateeb

Main category: cs.LG

TL;DR: A framework for measuring wireless dataset similarity using task- and model-aware metrics that predict cross-dataset transferability, applied to CSI compression and beam prediction tasks.

Details

Motivation: Need for systematic ways to compare wireless datasets for applications like dataset selection, sim2real comparison, synthetic data generation, and model adaptation decisions.

Method: Task- and model-aware framework using UMAP embeddings combined with Wasserstein/Euclidean distances for unsupervised tasks, and supervised UMAP with imbalance penalties for supervised tasks.

Result: Achieved Pearson correlations >0.85 between dataset distances and transfer performance for CSI compression; developed label-aware distances for beam prediction that outperform traditional baselines.

Conclusion: The framework enables task-relevant wireless dataset comparisons and effectively predicts model transferability across different wireless communication tasks.

Abstract: This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets, enabling applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, task-specific synthetic data generation, and informing decisions on model training/adaptation to new deployments. We evaluate candidate dataset distance metrics by how well they predict cross-dataset transferability: if two datasets have a small distance, a model trained on one should perform well on the other. We apply the framework on an unsupervised task, channel state information (CSI) compression, using autoencoders. Using metrics based on UMAP embeddings, combined with Wasserstein and Euclidean distances, we achieve Pearson correlations exceeding 0.85 between dataset distances and train-on-one/test-on-another task performance. We also apply the framework to a supervised beam prediction in the downlink using convolutional neural networks. For this task, we derive a label-aware distance by integrating supervised UMAP and penalties for dataset imbalance. Across both tasks, the resulting distances outperform traditional baselines and consistently exhibit stronger correlations with model transferability, supporting task-relevant comparisons between wireless datasets.

[544] Coarse-Grained Kullback–Leibler Control of Diffusion-Based Generative AI

Tatsuaki Tsuruyama

Main category: cs.LG

TL;DR: The paper proposes a V-delta projected reverse diffusion method that controls coarse-grained quantities (like blockwise intensity) in diffusion models using an information-theoretic Lyapunov function, maintaining block-mass errors within tolerance while preserving image quality.

Details

Motivation: Current diffusion models lack theoretical understanding of how coarse-grained quantities (blockwise intensity, class proportions) evolve during reverse diffusion. There's a need for explicit control over these structural properties while maintaining image quality.

Method: Transplants an information-theoretic Lyapunov function framework to reverse diffusion processes. Proposes V-delta projected reverse diffusion that projects the process using a leak-tolerant potential V-delta, which acts as an approximate Lyapunov function under small leakage conditions.

Result: Numerical experiments with a toy model (block-constant images and simplified reverse kernel) show the method keeps block-mass error and leak-tolerant potential within prescribed tolerance while achieving comparable pixel-wise accuracy and visual quality to non-projected dynamics.

Conclusion: The study reinterprets generative sampling as decreasing an information potential from noise to data, providing a design principle for reverse diffusion processes with explicit control of coarse-grained quantities while maintaining image quality.

Abstract: Diffusion models and score-based generative models provide a powerful framework for synthesizing high-quality images from noise. However, there is still no satisfactory theory that describes how coarse-grained quantities, such as blockwise intensity or class proportions after partitioning an image into spatial blocks, are preserved and evolve along the reverse diffusion dynamics. In previous work, the author introduced an information-theoretic Lyapunov function V for non-ergodic Markov processes on a state space partitioned into blocks, defined as the minimal Kullback-Leibler divergence to the set of stationary distributions reachable from a given initial condition, and showed that a leak-tolerant potential V-delta with a prescribed tolerance for block masses admits a closed-form expression as a scaling-and-clipping operation on block masses. In this paper, I transplant this framework to the reverse diffusion process in generative models and propose a reverse diffusion scheme that is projected by the potential V-delta (referred to as the V-delta projected reverse diffusion). I extend the monotonicity of V to time-inhomogeneous block-preserving Markov kernels and show that, under small leakage and the V-delta projection, V-delta acts as an approximate Lyapunov function. Furthermore, using a toy model consisting of block-constant images and a simplified reverse kernel, I numerically demonstrate that the proposed method keeps the block-mass error and the leak-tolerant potential within the prescribed tolerance, while achieving pixel-wise accuracy and visual quality comparable to the non-projected dynamics. This study reinterprets generative sampling as a decrease of an information potential from noise to data, and provides a design principle for reverse diffusion processes with explicit control of coarse-grained quantities.

[545] A UCB Bandit Algorithm for General ML-Based Estimators

Yajing Liu, Erkao Bao, Linqi Song

Main category: cs.LG

TL;DR: ML-UCB integrates arbitrary ML models into multi-armed bandit frameworks by modeling learning curve behavior, enabling principled exploration without model-specific theoretical analysis.

Details

Motivation: Deploying sophisticated ML models for sequential decision-making faces challenges due to lack of tractable concentration inequalities needed for principled exploration in bandit frameworks.

Method: ML-UCB assumes Mean Squared Error decreases as a power law in training samples, derives generalized concentration inequality, and integrates any ML model whose learning curve can be empirically characterized.

Result: ML-UCB achieves sublinear regret and demonstrates substantial improvements over LinUCB in experiments on collaborative filtering recommendation systems with synthetic data.

Conclusion: The framework enables principled integration of arbitrary ML models into bandit algorithms by directly modeling learning curve behavior, eliminating need for model-specific theoretical analysis.

Abstract: We present ML-UCB, a generalized upper confidence bound algorithm that integrates arbitrary machine learning models into multi-armed bandit frameworks. A fundamental challenge in deploying sophisticated ML models for sequential decision-making is the lack of tractable concentration inequalities required for principled exploration. We overcome this limitation by directly modeling the learning curve behavior of the underlying estimator. Specifically, assuming the Mean Squared Error decreases as a power law in the number of training samples, we derive a generalized concentration inequality and prove that ML-UCB achieves sublinear regret. This framework enables the principled integration of any ML model whose learning curve can be empirically characterized, eliminating the need for model-specific theoretical analysis. We validate our approach through experiments on a collaborative filtering recommendation system using online matrix factorization with synthetic data designed to simulate a simplified two-tower model, demonstrating substantial improvements over LinUCB

[546] SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models

Yunlin Zeng

Main category: cs.LG

TL;DR: Fine-tuned 32B VLM outperforms 235B base model in generating engaging podcast dialogues from images, using synthetic-to-real training and novel evaluation metrics.

Details

Motivation: VLMs excel at descriptive tasks but struggle with generating engaging, long-form narratives like podcast dialogues. Current metrics (BLEU, ROUGE) fail to capture conversational naturalness, personality, and narrative flow, often rewarding safe outputs over engaging storytelling.

Method: 1) Created novel pipeline for end-to-end visual podcast generation; 2) Fine-tuned Qwen3-VL-32B model on 4,000 curated image-dialogue pairs; 3) Used synthetic-to-real training strategy: trained on high-quality podcast dialogues from SPoRC paired with synthetically generated imagery; 4) Evaluated on real-world photo sequences from VIST; 5) Proposed comprehensive evaluation framework using AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate).

Result: Fine-tuned 32B model significantly outperforms 235B base model: >80% win rate in conversational naturalness, +50% turn length for narrative depth, while maintaining identical visual grounding capabilities (CLIPScore: 20.39). Demonstrates ability to generalize from synthetic training data to real-world visual domains.

Conclusion: The work successfully addresses the challenge of generating engaging podcast dialogues from images, showing that targeted fine-tuning on curated data with synthetic-to-real training can produce superior results compared to much larger base models, especially when evaluated with appropriate metrics that capture conversational quality.

Abstract: Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives – specifically multi-speaker podcast dialogues – remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model’s ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness ($>$80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).

[547] Tiny Machine Learning for Real-Time Aquaculture Monitoring: A Case Study in Morocco

Achraf Hsain, Yahya Zaki, Othman Abaakil, Hibat-allah Bekkar, Yousra Chtouki

Main category: cs.LG

TL;DR: This paper proposes using TinyML on low-power edge devices for real-time automated monitoring and control in aquaculture systems to address water quality, disease, and feed management challenges.

Details

Motivation: Traditional aquaculture monitoring relies on manual labor and is time-consuming, leading to delays in addressing issues like water quality fluctuations, disease outbreaks, and inefficient feed management. There's a need for automated, real-time solutions to improve sustainability and efficiency.

Method: Integration of low-power edge devices with Tiny Machine Learning (TinyML) for real-time automated monitoring and control. The system collects data from sensors measuring pH, temperature, dissolved oxygen, and ammonia levels, and implements anomaly detection with alert mechanisms.

Result: The system enables real-time data collection, automated monitoring, anomaly alerts, and supports decision-making for water treatment optimization, feed distribution, and pattern analysis. It reduces labor requirements and operational costs while improving resource utilization.

Conclusion: TinyML-based solutions show feasibility for aquaculture monitoring and can contribute to more sustainable and efficient farming practices by addressing hardware constraints, sensor selection, algorithm design, and ethical considerations.

Abstract: Aquaculture, the farming of aquatic organisms, is a rapidly growing industry facing challenges such as water quality fluctuations, disease outbreaks, and inefficient feed management. Traditional monitoring methods often rely on manual labor and are time consuming, leading to potential delays in addressing issues. This paper proposes the integration of low-power edge devices using Tiny Machine Learning (TinyML) into aquaculture systems to enable real-time automated monitoring and control, such as collecting data and triggering alarms, and reducing labor requirements. The system provides real-time data on the required parameters such as pH levels, temperature, dissolved oxygen, and ammonia levels to control water quality, nutrient levels, and environmental conditions enabling better maintenance, efficient resource utilization, and optimal management of the enclosed aquaculture space. The system enables alerts in case of anomaly detection. The data collected by the sensors over time can serve for important decision-making regarding optimizing water treatment processes, feed distribution, feed pattern analysis and improve feed efficiency, reducing operational costs. This research explores the feasibility of developing TinyML-based solutions for aquaculture monitoring, considering factors such as sensor selection, algorithm design, hardware constraints, and ethical considerations. By demonstrating the potential benefits of TinyML in aquaculture, our aim is to contribute to the development of more sustainable and efficient farming practices.

[548] Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs

Jing Wang, Peng Zhao, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: The paper proposes a refined analysis framework for weighted strategies in non-stationary parametric bandits, simplifying algorithm design while maintaining or improving regret bounds across various bandit settings including linear, generalized linear, self-concordant bandits, and extending to MDPs with function approximation.

Details

Motivation: Weighted strategies are commonly used for non-stationary bandits with gradual drifting patterns, but previous theoretical studies show they are either computationally inefficient or statistically suboptimal compared to sliding-window/restart strategies. The paper aims to address these limitations by developing a better analysis framework.

Method: Proposes a refined analysis framework for weighted strategies that simplifies derivation and algorithm design. Applies this framework to linear bandits (LB), generalized linear bandits (GLB), self-concordant bandits (SCB), and extends to non-stationary MDPs with function approximation (Linear Mixture MDP and Multinomial Logit Mixture MDP).

Result: For linear bandits: produces simpler weight-based algorithm as efficient as window/restart-based algorithms with same regret. For GLB: achieves improved regret bound of $\tilde{O}(k_μ^{5/4} c_μ^{-3/4} d^{3/4} P_T^{1/4}T^{3/4})$ vs previous $\tilde{O}(k_μ^{2} c_μ^{-1}d^{9/10} P_T^{1/5}T^{4/5})$. Extends framework to non-stationary MDPs with dynamic regret guarantees.

Conclusion: The refined analysis framework successfully addresses previous limitations of weighted strategies, enabling simpler and more efficient algorithms while maintaining or improving regret bounds across various non-stationary parametric bandit settings and extending to MDPs with function approximation.

Abstract: Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a \emph{refined analysis framework}, which simplifies the derivation and, importantly, produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an $\tilde{O}(k_μ^{5/4} c_μ^{-3/4} d^{3/4} P_T^{1/4}T^{3/4})$ regret, improving the $\tilde{O}(k_μ^{2} c_μ^{-1}d^{9/10} P_T^{1/5}T^{4/5})$ bound in prior work, where $k_μ$ and $c_μ$ characterize the reward model’s nonlinearity, $P_T$ measures the non-stationarity, $d$ and $T$ denote the dimension and time horizon. Moreover, we extend our framework to non-stationary Markov Decision Processes (MDPs) with function approximation, focusing on Linear Mixture MDP and Multinomial Logit (MNL) Mixture MDP. For both classes, we propose algorithms based on the weighted strategy and establish dynamic regret guarantees using our analysis framework.

[549] Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

Main category: cs.LG

TL;DR: Flow Equivariant World Models use Lie group flows to unify self-motion and object motion, implementing group equivariance for stable latent representations that outperform diffusion-based models in partially observed video tasks.

Details

Motivation: Current neural network world models ignore the smooth, time-parameterized symmetries present in embodied systems' continuous sensory streams, forcing repeated re-learning of transformations instead of leveraging inherent structure.

Method: Unify self-motion and external object motion as one-parameter Lie group ‘flows’, then implement group equivariance with respect to these transformations to create stable latent world representations.

Result: Significantly outperform state-of-the-art diffusion-based and memory-augmented world models on 2D/3D partially observed video benchmarks, especially for predictable dynamics outside current field of view, with strong generalization beyond training horizon.

Conclusion: Flow equivariance provides a scalable route to data-efficient, symmetry-guided embodied intelligence by structuring world model representations with respect to internal and external motion symmetries.

Abstract: Embodied systems experience the world as ‘a symphony of flows’: a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce ‘Flow Equivariant World Models’, a framework in which both self-motion and external object motion are unified as one-parameter Lie group ‘flows’. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures – particularly when there are predictable world dynamics outside the agent’s current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry-guided, embodied intelligence. Project link: https://flowequivariantworldmodels.github.io.

[550] Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

Bryon Tjanaka, Henry Chen, Matthew C. Fontaine, Stefanos Nikolaidis

Main category: cs.LG

TL;DR: DMS (Discount Model Search) improves QD optimization for high-dimensional measures by using a continuous discount model instead of histogram-based methods, enabling better exploration and new applications with image-based measures.

Details

Motivation: Current QD algorithms struggle with high-dimensional measure spaces because solutions with similar measures get grouped together (distortion), causing stagnation. Existing methods like CMA-MAE use histograms that fail to distinguish between similar measures in high dimensions.

Method: Proposes Discount Model Search (DMS) which uses a smooth, continuous model to represent discount values instead of discrete histogram cells. This model can distinguish between solutions with similar measures in high-dimensional spaces.

Result: DMS outperforms CMA-MAE and other black-box QD algorithms on high-dimensional benchmarks. Enables new QD applications where measure spaces are high-dimensional image spaces, allowing users to specify measures via image datasets rather than hand-designed functions.

Conclusion: DMS addresses the limitations of histogram-based QD methods in high-dimensional measure spaces, providing better exploration and enabling practical applications with complex, high-dimensional measures like images.

Abstract: Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms focus on low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and thus receive identical discount values. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new QD applications by introducing two domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other black-box QD algorithms.

[551] Central Dogma Transformer: Towards Mechanism-Oriented AI for Cellular Understanding

Nobuyuki Ota

Main category: cs.LG

TL;DR: CDT integrates DNA, RNA, and protein models using Central Dogma principles, achieving 0.503 correlation on CRISPRi data with interpretable attention mechanisms.

Details

Motivation: Current domain-specific foundation models for DNA, RNA, and protein remain isolated, limiting integrated modeling of cellular processes following the Central Dogma's directional information flow.

Method: Central Dogma Transformer (CDT) integrates pre-trained DNA, RNA, and protein language models using directional cross-attention: DNA-to-RNA for transcriptional regulation and RNA-to-Protein for translational relationships, producing unified Virtual Cell Embeddings.

Result: CDT v1 achieved Pearson correlation of 0.503 on CRISPRi enhancer perturbation data from K562 cells, representing 63% of theoretical ceiling (r=0.797). Attention and gradient analyses provided interpretability, identifying biologically relevant features like CTCF binding sites.

Conclusion: AI architectures aligned with biological information flow (Central Dogma) can achieve both predictive accuracy and mechanistic interpretability for integrated cellular modeling.

Abstract: Understanding cellular mechanisms requires integrating information across DNA, RNA, and protein - the three molecular systems linked by the Central Dogma of molecular biology. While domain-specific foundation models have achieved success for each modality individually, they remain isolated, limiting our ability to model integrated cellular processes. Here we present the Central Dogma Transformer (CDT), an architecture that integrates pre-trained language models for DNA, RNA, and protein following the directional logic of the Central Dogma. CDT employs directional cross-attention mechanisms - DNA-to-RNA attention models transcriptional regulation, while RNA-to-Protein attention models translational relationships - producing a unified Virtual Cell Embedding that integrates all three modalities. We validate CDT v1 - a proof-of-concept implementation using fixed (non-cell-specific) RNA and protein embeddings - on CRISPRi enhancer perturbation data from K562 cells, achieving a Pearson correlation of 0.503, representing 63% of the theoretical ceiling set by cross-experiment variability (r = 0.797). Attention and gradient analyses provide complementary interpretive windows: in detailed case studies, these approaches highlight largely distinct genomic regions, with gradient analysis identifying a CTCF binding site that Hi-C data showed as physically contacting both enhancer and target gene. These results suggest that AI architectures aligned with biological information flow can achieve both predictive accuracy and mechanistic interpretability.

[552] Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings

Muhammad Ashad Kabir, Sirajam Munira, Dewan Tasnia Azad, Saleh Mohammed Ikram, Mohammad Habibur Rahman Sarker, Syed Manzoor Ahmed Hanifi

Main category: cs.LG

TL;DR: Developed explainable ML framework for early CKD screening in Bangladesh/South Asia using community data, achieving 90% accuracy with fewer features than existing tools.

Details

Motivation: Existing CKD screening tools from high-income countries underperform in South Asia due to different risk profiles, reliance on simple additive scoring, and focus on advanced-stage CKD patients, failing to detect early-stage disease.

Method: Used first community-based CKD dataset from Bangladesh, evaluated 12 ML classifiers with multiple feature domains, applied 10 feature selection techniques, used 10-fold cross-validation, and performed external validation on three independent datasets from India, UAE, and Bangladesh.

Result: ML model achieved 90.40% balanced accuracy with RFECV-selected features; minimal non-pathology features achieved 89.23% accuracy, outperforming larger feature sets. External validation showed 78-98% sensitivity, substantially better than existing tools.

Conclusion: The explainable ML framework provides accurate, accessible early CKD screening for low-resource South Asian settings with strong generalizability and clinically meaningful predictors identified through SHAP analysis.

Abstract: Early detection of chronic kidney disease (CKD) is essential for preventing progression to end-stage renal disease. However, existing screening tools - primarily developed using populations from high-income countries - often underperform in Bangladesh and South Asia, where risk profiles differ. Most of these tools rely on simple additive scoring functions and are based on data from patients with advanced-stage CKD. Consequently, they fail to capture complex interactions among risk factors and are limited in predicting early-stage CKD. Our objective was to develop and evaluate an explainable machine learning (ML) framework for community-based early-stage CKD screening for low-resource settings, tailored to the Bangladeshi and South Asian population context. We used a community-based dataset from Bangladesh, the first such CKD dataset in South and South Asia, and evaluated twelve ML classifiers across multiple feature domains. Ten complementary feature selection techniques were applied to identify robust, generalizable predictors. The final models were assessed using 10-fold cross-validation. External validation was conducted on three independent datasets from India, the UAE, and Bangladesh. SHAP (SHapley Additive exPlanations) was used to provide model explainability. An ML model trained on an RFECV-selected feature subset achieved a balanced accuracy of 90.40%, whereas minimal non-pathology-test features demonstrated excellent predictive capability with a balanced accuracy of 89.23%, often outperforming larger or full feature sets. Compared with existing screening tools, the proposed models achieved substantially higher accuracy and sensitivity while requiring fewer and more accessible inputs. External validation confirmed strong generalizability with 78% to 98% sensitivity. SHAP interpretation identified clinically meaningful predictors consistent with established CKD risk factors.

[553] Learning from Historical Activations in Graph Neural Networks

Yaniv Galron, Hadar Sinai, Haggai Maron, Moshe Eliasof

Main category: cs.LG

TL;DR: HISTOGRAPH is a novel two-stage attention-based graph pooling method that leverages historical activations from all GNN layers, not just the last layer, to improve graph classification performance and robustness in deep architectures.

Details

Motivation: Current graph pooling methods only use the last GNN layer's features, potentially under-utilizing important historical activations from previous layers. This gap is especially problematic when node representations shift significantly across layers and in cases of over-smoothing in deep GNN architectures.

Method: HISTOGRAPH introduces a two-stage attention-based final aggregation layer: 1) unified layer-wise attention over intermediate activations from all GNN layers, followed by 2) node-wise attention. This approach models the evolution of node representations across layers to leverage both activation history and graph structure.

Result: Empirical results on multiple graph classification benchmarks show that HISTOGRAPH consistently improves over traditional techniques and offers strong robustness in deep GNNs.

Conclusion: By effectively utilizing historical graph activations from all layers, HISTOGRAPH addresses limitations of previous pooling methods and demonstrates superior performance and robustness for graph classification tasks.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node’s representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.

[554] Wittgenstein’s Family Resemblance Clustering Algorithm

Golbahar Amanpour, Benyamin Ghojogh

Main category: cs.LG

TL;DR: The paper introduces WFR clustering algorithm based on Wittgenstein’s family resemblance concept, which clusters data by building resemblance graphs without needing to specify cluster count or shape assumptions.

Details

Motivation: To develop a clustering algorithm that doesn't require prior knowledge of cluster count or shape assumptions, inspired by Wittgenstein's philosophical concept of family resemblance where categories are defined by overlapping similarities rather than rigid definitions.

Method: Proposes Wittgenstein’s Family Resemblance (WFR) clustering algorithm and its kernel variant. The method computes resemblance scores between neighboring data instances, thresholds these scores to construct a resemblance graph, and identifies clusters as connected components of this graph.

Result: Simulations on benchmark datasets demonstrate that WFR is an effective nonlinear clustering algorithm that successfully clusters data without requiring prior knowledge of cluster count or assumptions about cluster shapes.

Conclusion: Wittgenstein’s philosophical concept of family resemblance provides a valuable framework for developing flexible clustering algorithms that can handle complex, non-linear data structures without restrictive assumptions.

Abstract: This paper, introducing a novel method in philomatics, draws on Wittgenstein’s concept of family resemblance from analytic philosophy to develop a clustering algorithm for machine learning. According to Wittgenstein’s Philosophical Investigations (1953), family resemblance holds that members of a concept or category are connected by overlapping similarities rather than a single defining property. Consequently, a family of entities forms a chain of items sharing overlapping traits. This philosophical idea naturally lends itself to a graph-based approach in machine learning. Accordingly, we propose the Wittgenstein’s Family Resemblance (WFR) clustering algorithm and its kernel variant, kernel WFR. This algorithm computes resemblance scores between neighboring data instances, and after thresholding these scores, a resemblance graph is constructed. The connected components of this graph define the resulting clusters. Simulations on benchmark datasets demonstrate that WFR is an effective nonlinear clustering algorithm that does not require prior knowledge of the number of clusters or assumptions about their shapes.

[555] Self-Training the Neurochaos Learning Algorithm

Anusree M, Akhila Henry, Pramod P Nair

Main category: cs.LG

TL;DR: A hybrid semi-supervised learning architecture combining Neurochaos Learning with Self-Training improves classification performance on limited, nonlinear, and imbalanced datasets.

Details

Motivation: In many real-world applications, labeled data is scarce and expensive to obtain while unlabeled data is abundant. Traditional supervised learning methods perform poorly with limited labeled data or imbalanced datasets.

Method: Proposes Self-Training Neurochaos Learning (NL+ST) architecture that integrates Neurochaos Learning (transforms input features into chaos-based ring-rate representations capturing nonlinear relationships) with threshold-based Self-Training (iteratively expands labeled set using high-confidence pseudo-labeled samples).

Result: The NL+ST architecture consistently outperforms standalone ST models across 10 benchmark datasets with 5 ML classifiers, achieving significant performance gains on limited, nonlinear, and imbalanced datasets: Iris (188.66%), Wine (158.58%), and Glass Identification (110.48%).

Conclusion: Chaos-based feature extraction combined with semi-supervised learning enhances generalization, robustness, and classification accuracy in low-data scenarios.

Abstract: In numerous practical applications, acquiring substantial quantities of labelled data is challenging and expensive, but unlabelled data is readily accessible. Conventional supervised learning methods frequently underperform in scenarios characterised by little labelled data or imbalanced datasets. This study introduces a hybrid semi-supervised learning (SSL) architecture that integrates Neurochaos Learning (NL) with a threshold-based Self-Training (ST) method to overcome this constraint. The NL architecture converts input characteristics into chaos-based ring-rate representations that encapsulate nonlinear relationships within the data, whereas ST progressively enlarges the labelled set utilising high-confidence pseudo-labelled samples. The model’s performance is assessed using ten benchmark datasets and five machine learning classifiers, with 85% of the training data considered unlabelled and just 15% utilised as labelled data. The proposed Self-Training Neurochaos Learning (NL+ST) architecture consistently attains superior performance gain relative to standalone ST models, especially on limited, nonlinear and imbalanced datasets like Iris (188.66%), Wine (158.58%) and Glass Identification (110.48%). The results indicate that using chaos-based feature extraction with SSL improves generalisation, resilience, and classification accuracy in low-data contexts.

[556] Evo-TFS: Evolutionary Time-Frequency Domain-Based Synthetic Minority Oversampling Approach to Imbalanced Time Series Classification

Wenbin Pei, Ruohao Dai, Bing Xue, Mengjie Zhang, Qiang Zhang, Yiu-Ming Cheung

Main category: cs.LG

TL;DR: Evo-TFS is an evolutionary oversampling method for imbalanced time series classification that uses strongly typed genetic programming to generate diverse minority-class samples while preserving both time- and frequency-domain characteristics.

Details

Motivation: Existing deep learning methods for time series classification assume balanced data distributions and perform poorly on imbalanced data, while traditional oversampling methods using linear interpolation fail to preserve temporal dynamics and generate diverse samples.

Method: Evo-TFS uses strongly typed genetic programming to evolve diverse, high-quality time series samples. It incorporates both time-domain and frequency-domain characteristics in its fitness function to guide the evolutionary process for generating minority-class samples.

Result: Experiments on imbalanced time series datasets show that Evo-TFS outperforms existing oversampling methods and significantly enhances the performance of both time-domain and frequency-domain classifiers.

Conclusion: Evo-TFS effectively addresses the imbalanced time series classification problem by generating diverse, high-quality minority-class samples that preserve important temporal and spectral characteristics, leading to improved classifier performance.

Abstract: Time series classification is a fundamental machine learning task with broad real-world applications. Although many deep learning methods have proven effective in learning time-series data for classification, they were originally developed under the assumption of balanced data distributions. Once data distribution is uneven, these methods tend to ignore the minority class that is typically of higher practical significance. Oversampling methods have been designed to address this by generating minority-class samples, but their reliance on linear interpolation often hampers the preservation of temporal dynamics and the generation of diverse samples. Therefore, in this paper, we propose Evo-TFS, a novel evolutionary oversampling method that integrates both time- and frequency-domain characteristics. In Evo-TFS, strongly typed genetic programming is employed to evolve diverse, high-quality time series, guided by a fitness function that incorporates both time-domain and frequency-domain characteristics. Experiments conducted on imbalanced time series datasets demonstrate that Evo-TFS outperforms existing oversampling methods, significantly enhancing the performance of time-domain and frequency-domain classifiers.

[557] Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

Main category: cs.LG

TL;DR: ARISE uses LLMs to enhance categorical data clustering by incorporating semantic knowledge from external language models to create better similarity measures between categorical values.

Details

Motivation: Categorical data clustering suffers from poor similarity measures since categorical values lack inherent ordering/distance. Existing methods rely on within-dataset co-occurrence patterns which become unreliable with limited samples, leaving semantic context underexplored.

Method: ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) uses LLMs to describe attribute values for representation enhancement, then combines LLM-enhanced embeddings with original data to explore semantically prominent clusters.

Result: Experiments on eight benchmark datasets show consistent improvements over seven representative counterparts, with gains of 19-27%.

Conclusion: Incorporating external semantic knowledge from LLMs effectively bridges the semantic gap in categorical data clustering, leading to significantly improved clustering quality.

Abstract: Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%. Code is available at https://github.com/develop-yang/ARISE

[558] MentalGame: Predicting Personality-Job Fitness for Software Developers Using Multi-Genre Games and Machine Learning Approaches

Soroush Elyasi, Arya VarastehNezhad, Fattaneh Taghiyareh

Main category: cs.LG

TL;DR: Game-based assessment using machine learning predicts software development suitability with 94% accuracy from gameplay behavior, offering a less biased alternative to traditional personality questionnaires.

Details

Motivation: Traditional personality assessments (self-report questionnaires) have limitations including response bias, fatigue, and intentional distortion. There's a need for more objective, engaging, and scalable assessment methods for career guidance and personnel selection.

Method: 1. Identified developer-relevant personality/behavioral traits through literature review and study of professional software engineers. 2. Designed custom mobile game to elicit behaviors related to problem solving, planning, adaptability, persistence, time management, and information seeking. 3. Used two-phase machine learning modeling strategy to predict suitability exclusively from gameplay-derived behavioral features.

Result: Model achieved up to 97% precision and 94% accuracy. Proper candidates exhibited distinct gameplay patterns: more wins in puzzle-based games, more side challenges, more frequent menu navigation, fewer pauses, retries, and surrender actions.

Conclusion: Implicit behavioral traces captured during gameplay can effectively predict software development suitability without explicit personality testing. Serious games offer a scalable, engaging, and less biased alternative for career assessment.

Abstract: Personality assessment in career guidance and personnel selection traditionally relies on self-report questionnaires, which are susceptible to response bias, fatigue, and intentional distortion. Game-based assessment offers a promising alternative by capturing implicit behavioral signals during gameplay. This study proposes a multi-genre serious-game framework combined with machine-learning techniques to predict suitability for software development roles. Developer-relevant personality and behavioral traits were identified through a systematic literature review and an empirical study of professional software engineers. A custom mobile game was designed to elicit behaviors related to problem solving, planning, adaptability, persistence, time management, and information seeking. Fine-grained gameplay event data were collected and analyzed using a two-phase modeling strategy where suitability was predicted exclusively from gameplay-derived behavioral features. Results show that our model achieved up to 97% precision and 94% accuracy. Behavioral analysis revealed that proper candidates exhibited distinct gameplay patterns, such as more wins in puzzle-based games, more side challenges, navigating menus more frequently, and exhibiting fewer pauses, retries, and surrender actions. These findings demonstrate that implicit behavioral traces captured during gameplay is promising in predicting software-development suitability without explicit personality testing, supporting serious games as a scalable, engaging, and less biased alternative for career assessment.

[559] Sparse Bayesian Message Passing under Structural Uncertainty

Yoonhyuk Choi, Jiho Choi, Chanran Kim, Yumin Lee, Hawon Shin, Yeowon Jeon, Minjeong Kim, Jiwoo Kang

Main category: cs.LG

TL;DR: A Bayesian graph neural network that models structural uncertainty via posterior distributions over signed adjacency matrices to handle heterophily and edge noise.

Details

Motivation: Real-world graphs often suffer from heterophily (label-disassortative connections) and unreliable graph structures, which challenge traditional GNNs that rely on fixed adjacency or simple regularization.

Method: Models posterior distribution over signed adjacency matrices (positive, negative, or absent edges), uses sparse signed message passing network, combines posterior marginalization with sparse signed message aggregation from Bayesian perspective.

Result: Outperforms strong baseline models on heterophilic benchmarks under both synthetic and real-world structural noise.

Conclusion: Provides principled Bayesian approach to handle edge noise and heterophily in graph neural networks through explicit modeling of structural uncertainty.

Abstract: Semi-supervised learning on real-world graphs is frequently challenged by heterophily, where the observed graph is unreliable or label-disassortative. Many existing graph neural networks either rely on a fixed adjacency structure or attempt to handle structural noise through regularization. In this work, we explicitly capture structural uncertainty by modeling a posterior distribution over signed adjacency matrices, allowing each edge to be positive, negative, or absent. We propose a sparse signed message passing network that is naturally robust to edge noise and heterophily, which can be interpreted from a Bayesian perspective. By combining (i) posterior marginalization over signed graph structures with (ii) sparse signed message aggregation, our approach offers a principled way to handle both edge noise and heterophily. Experimental results demonstrate that our method outperforms strong baseline models on heterophilic benchmarks under both synthetic and real-world structural noise.

[560] Adaptive Conformal Prediction via Bayesian Uncertainty Weighting for Hierarchical Healthcare Data

Marzieh Amiri Shahbazi, Ali Baheri, Nasibeh Azadeh-Fard

Main category: cs.LG

TL;DR: Hybrid Bayesian-conformal framework for healthcare predictions that combines Bayesian hierarchical random forests with group-aware conformal calibration to achieve both distribution-free coverage guarantees and risk-adaptive precision.

Details

Motivation: Clinical decision-making requires uncertainty quantification with both distribution-free coverage guarantees and risk-adaptive precision, which existing methods fail to provide jointly. There's a need for methods that can offer rigorous coverage validity while adapting precision based on risk levels.

Method: Integrates Bayesian hierarchical random forests with group-aware conformal calibration. Uses posterior uncertainties to weight conformity scores while maintaining rigorous coverage validity. The hybrid approach combines Bayesian modeling with conformal prediction techniques.

Result: Achieved target coverage (94.3% vs 95% target) with adaptive precision: 21% narrower intervals for low-uncertainty cases while appropriately widening for high-risk predictions. Evaluated on 61,538 admissions across 3,793 U.S. hospitals and 4 regions. Demonstrated that well-calibrated Bayesian uncertainties alone severely under-cover (14.1%).

Conclusion: The hybrid Bayesian-conformal framework addresses fundamental limitations in healthcare uncertainty quantification, enabling risk-stratified clinical protocols, efficient resource planning for high-confidence predictions, and conservative allocation with enhanced oversight for uncertain cases across diverse healthcare settings.

Abstract: Clinical decision-making demands uncertainty quantification that provides both distribution-free coverage guarantees and risk-adaptive precision, requirements that existing methods fail to jointly satisfy. We present a hybrid Bayesian-conformal framework that addresses this fundamental limitation in healthcare predictions. Our approach integrates Bayesian hierarchical random forests with group-aware conformal calibration, using posterior uncertainties to weight conformity scores while maintaining rigorous coverage validity. Evaluated on 61,538 admissions across 3,793 U.S. hospitals and 4 regions, our method achieves target coverage (94.3% vs 95% target) with adaptive precision: 21% narrower intervals for low-uncertainty cases while appropriately widening for high-risk predictions. Critically, we demonstrate that well-calibrated Bayesian uncertainties alone severely under-cover (14.1%), highlighting the necessity of our hybrid approach. This framework enables risk-stratified clinical protocols, efficient resource planning for high-confidence predictions, and conservative allocation with enhanced oversight for uncertain cases, providing uncertainty-aware decision support across diverse healthcare settings.

[561] The Dependency Divide: An Interpretable Machine Learning Framework for Profiling Student Digital Satisfaction in the Bangladesh Context

Md Muhtasim Munif Fahim, Humyra Ankona, Md Monimul Huq, Md Rezaul Karim

Main category: cs.LG

TL;DR: The study introduces “Dependency Divide” framework showing that highly engaged students become vulnerable to infrastructure failures, challenging assumptions that engagement always benefits learners in digital environments.

Details

Motivation: Traditional digital divide frameworks fail to explain why satisfaction with digital learning platforms varies among students with equal connectivity in resource-constrained contexts.

Method: Cross-sectional study of 396 Bangladeshi university students using three-stage analysis: K-prototypes clustering for student profiles, profile-specific Random Forest models with SHAP/ALE analysis, and interaction analysis with propensity score matching.

Result: Three student profiles identified: Casually Engaged (58%), Efficient Learners (35%), Hyper-Engaged (7%). Dependency Divide confirmed: engagement increases satisfaction only with reliable infrastructure. Hyper-Engaged students most vulnerable. Targeted reliability improvements yield 2.06x greater returns than uniform interventions.

Conclusion: In fragile infrastructure contexts, capability becomes liability. Policies must prioritize reliability for dependency-prone users, establish contingency systems, and educate about dependency risks rather than uniformly promoting engagement.

Abstract: Background: While digital access has expanded rapidly in resource-constrained contexts, satisfaction with digital learning platforms varies significantly among students with seemingly equal connectivity. Traditional digital divide frameworks fail to explain these variations. Purpose: This study introduces the “Dependency Divide”, a novel framework proposing that highly engaged students become conditionally vulnerable to infrastructure failures, challenging assumptions that engagement uniformly benefits learners in post-access environments. Methods: We conducted a cross-sectional study of 396 university students in Bangladesh using a three-stage analytical approach: (1) stability-validated K-prototypes clustering to identify student profiles, (2) profile-specific Random Forest models with SHAP and ALE analysis to determine satisfaction drivers, and (3) formal interaction analysis with propensity score matching to test the Dependency Divide hypothesis. Results: Three distinct profiles emerged: Casually Engaged (58%), Efficient Learners (35%), and Hyper-Engaged (7%). A significant interaction between educational device time and internet reliability (\b{eta} = 0.033, p = 0.028) confirmed the Dependency Divide: engagement increased satisfaction only when infrastructure remained reliable. Hyper-Engaged students showed greatest vulnerability despite or because of their sophisticated digital workflows. Policy simulations demonstrated that targeted reliability improvements for high-dependency users yielded 2.06 times greater returns than uniform interventions. Conclusions: In fragile infrastructure contexts, capability can become liability. Digital transformation policies must prioritize reliability for dependency-prone users, establish contingency systems, and educate students about dependency risks rather than uniformly promoting engagement.

[562] Benchmarking the Computational and Representational Efficiency of State Space Models against Transformers on Long-Context Dyadic Sessions

Abidemi Koledoye, Chinemerem Unachukwu, Gold Nwobu, Hasin Rana

Main category: cs.LG

TL;DR: Mamba SSM vs LLaMA Transformer benchmark on long-context sequences shows SSMs offer linear O(N) complexity vs Transformer’s quadratic O(N²), with comprehensive evaluation on computational and representational efficiency using therapy sessions as test case.

Details

Motivation: State Space Models (SSMs) have emerged as a promising alternative to Transformers for long-context sequence modeling due to their linear computational complexity, but there's a need for comprehensive benchmarking to establish precise conditions where SSMs offer advantages over Transformers.

Method: Comprehensive benchmarking study comparing Mamba SSM against LLaMA Transformer on long-context sequences using dyadic therapy sessions as test case. Evaluation across two dimensions: computational efficiency (memory usage and inference speed from 512 to 8,192 tokens) and representational efficiency (hidden state dynamics and attention patterns).

Result: The study provides actionable insights for practitioners working with long-context applications, establishing precise conditions under which SSMs offer advantages over Transformers, though specific numerical results aren’t detailed in the abstract.

Conclusion: SSMs present a viable alternative to Transformers for long-context modeling with linear computational complexity, and the benchmarking provides practical guidance for when to choose SSMs over Transformers based on specific application requirements and sequence lengths.

Abstract: State Space Models (SSMs) have emerged as a promising alternative to Transformers for long-context sequence modeling, offering linear $O(N)$ computational complexity compared to the Transformer’s quadratic $O(N^2)$ scaling. This paper presents a comprehensive benchmarking study comparing the Mamba SSM against the LLaMA Transformer on long-context sequences, using dyadic therapy sessions as a representative test case. We evaluate both architectures across two dimensions: (1) computational efficiency, where we measure memory usage and inference speed from 512 to 8,192 tokens, and (2) representational efficiency, where we analyze hidden state dynamics and attention patterns. Our findings provide actionable insights for practitioners working with long-context applications, establishing precise conditions under which SSMs offer advantages over Transformers.

[563] Accelerated Full Waveform Inversion by Deep Compressed Learning

Maayan Gelboim, Amir Adler, Mauricio Araya-Polo

Main category: cs.LG

TL;DR: A deep learning method reduces FWI computational cost by hierarchically selecting the most relevant seismic data through compressed learning, autoencoder representation, and clustering, achieving better results than random sampling with only 10% of data.

Details

Motivation: Industrial-scale Full Waveform Inversion requires teraflop-level data storage, making complex subsurface analysis and multi-scenario exploration computationally prohibitive. There's a need to reduce FWI dimensionality while maintaining accuracy.

Method: Uses deep neural network with binarized sensing layer for compressed learning to select optimal seismic acquisition layout. Then applies autoencoder for latent representation learning, followed by K-means clustering to hierarchically select the most relevant data for FWI.

Result: The approach consistently outperforms random data sampling, achieving effective 2D FWI results using only 10% of the original data. This demonstrates potential for accelerating large-scale 3D inversion.

Conclusion: The hierarchical data selection method enables significant computational cost reduction for FWI while maintaining accuracy, paving the way for practical large-scale 3D seismic inversion applications.

Abstract: We propose and test a method to reduce the dimensionality of Full Waveform Inversion (FWI) inputs as computational cost mitigation approach. Given modern seismic acquisition systems, the data (as input for FWI) required for an industrial-strength case is in the teraflop level of storage, therefore solving complex subsurface cases or exploring multiple scenarios with FWI become prohibitive. The proposed method utilizes a deep neural network with a binarized sensing layer that learns by compressed learning a succinct but consequential seismic acquisition layout from a large corpus of subsurface models. Thus, given a large seismic data set to invert, the trained network selects a smaller subset of the data, then by using representation learning, an autoencoder computes latent representations of the data, followed by K-means clustering of the latent representations to further select the most relevant data for FWI. Effectively, this approach can be seen as a hierarchical selection. The proposed approach consistently outperforms random data sampling, even when utilizing only 10% of the data for 2D FWI, these results pave the way to accelerating FWI in large scale 3D inversion.

[564] The Alchemy of Thought: Understanding In-Context Learning Through Supervised Classification

Harshita Narnoli, Mihai Surdeanu

Main category: cs.LG

TL;DR: ICL behaves similarly to supervised classifiers when demonstration relevance is high, closer to kNN than logistic regression, but outperforms them when relevance is low due to LLMs’ parametric memory.

Details

Motivation: Despite empirical evidence of ICL's usefulness, there's limited understanding of how it actually works. The paper aims to investigate whether LLMs with ICL behave similarly to supervised classifiers trained on the same demonstrations, and under what conditions.

Method: Compare ICL behavior with supervised classifiers (gradient descent-based logistic regression and k-nearest neighbors) trained on ICL demonstrations. Use text classification as a use case with six datasets and three LLMs to analyze three research questions about similarity between ICL and classifiers.

Result: LLMs behave similarly to classifiers when demonstration relevance is high, with ICL being closer to kNN than logistic regression. When demonstration relevance is low, LLMs perform better than classifiers due to their ability to rely on parametric memory.

Conclusion: ICL’s attention mechanism behaves more similarly to kNN than gradient descent, and LLMs’ advantage over simple classifiers comes from their ability to fall back on parametric knowledge when demonstrations are less relevant.

Abstract: In-context learning (ICL) has become a prominent paradigm to rapidly customize LLMs to new tasks without fine-tuning. However, despite the empirical evidence of its usefulness, we still do not truly understand how ICL works. In this paper, we compare the behavior of in-context learning with supervised classifiers trained on ICL demonstrations to investigate three research questions: (1) Do LLMs with ICL behave similarly to classifiers trained on the same examples? (2) If so, which classifiers are closer, those based on gradient descent (GD) or those based on k-nearest neighbors (kNN)? (3) When they do not behave similarly, what conditions are associated with differences in behavior? Using text classification as a use case, with six datasets and three LLMs, we observe that LLMs behave similarly to these classifiers when the relevance of demonstrations is high. On average, ICL is closer to kNN than logistic regression, giving empirical evidence that the attention mechanism behaves more similarly to kNN than GD. However, when demonstration relevance is low, LLMs perform better than these classifiers, likely because LLMs can back off to their parametric memory, a luxury these classifiers do not have.

[565] Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space

Changhoon Song, Seungchan Ko, Youngjoon Hong

Main category: cs.LG

TL;DR: The paper introduces a new log-weighted Barron space that requires weaker regularity assumptions than classical Barron spaces, and shows how deep ReLU networks can efficiently approximate functions in this space with explicit depth dependence.

Details

Motivation: Classical Barron space theory explains neural network approximation but requires stronger regularity than Sobolev spaces, and existing depth-sensitive results have restrictive constraints. The paper aims to develop a more realistic theory that better explains practical success of deep models on high-dimensional data.

Method: Introduces log-weighted Barron space ℬ^log with weaker assumptions than classical ℬ^s spaces. Studies embedding properties, statistical analysis via Rademacher complexity, and proves approximation bounds for deep ReLU networks with explicit depth dependence. Defines family ℬ^{s,log} and establishes H^1 norm approximation bounds.

Result: Shows functions in ℬ^log can be approximated by deep ReLU networks with explicit depth dependence. Identifies maximal depth scales preserving approximation rates. Demonstrates how depth reduces regularity requirements for efficient representation.

Conclusion: The new log-weighted Barron space provides a more precise explanation for deep architecture performance beyond classical Barron setting, offering theoretical insight into why deep networks work well on high-dimensional problems despite weaker regularity requirements.

Abstract: Universal approximation theorems show that neural networks can approximate any continuous function; however, the number of parameters may grow exponentially with the ambient dimension, so these results do not fully explain the practical success of deep models on high-dimensional data. Barron space theory addresses this: if a target function belongs to a Barron space, a two-layer network with $n$ parameters achieves an $O(n^{-1/2})$ approximation error in $L^2$. Yet classical Barron spaces $\mathscr{B}^{s+1}$ still require stronger regularity than Sobolev spaces $H^s$, and existing depth-sensitive results often assume constraints such as $sL \le 1/2$. In this paper, we introduce a log-weighted Barron space $\mathscr{B}^{\log}$, which requires a strictly weaker assumption than $\mathscr{B}^s$ for any $s>0$. For this new function space, we first study embedding properties and carry out a statistical analysis via the Rademacher complexity. Then we prove that functions in $\mathscr{B}^{\log}$ can be approximated by deep ReLU networks with explicit depth dependence. We then define a family $\mathscr{B}^{s,\log}$, establish approximation bounds in the $H^1$ norm, and identify maximal depth scales under which these rates are preserved. Our results clarify how depth reduces regularity requirements for efficient representation, offering a more precise explanation for the performance of deep architectures beyond the classical Barron setting, and for their stable use in high-dimensional problems used today.

[566] ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System

Anantha Sharma

Main category: cs.LG

TL;DR: Argus is a framework for detecting distributional drift in high-dimensional data streams using fixed spatial partitions (Voronoi tessellations) to track local statistics, achieving O(N) complexity while preserving geometric structure and providing spatial localization of changes.

Details

Motivation: Existing drift detection methods face three key challenges: global comparison methods scale poorly with high dimensions, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity instability. There's a need for a method that preserves high-dimensional structure while being computationally efficient.

Method: Argus reconceptualizes drift detection as tracking local statistics over a fixed spatial partition of the data manifold. It uses Voronoi tessellations over canonical orthonormal frames, which provides drift metrics invariant to orthogonal transformations. The framework includes: 1) Voronoi tessellations for spatial partitioning, 2) O(N) complexity per snapshot, 3) graph-theoretic characterization of drift propagation, and 4) product quantization tessellation for scaling to very high dimensions (d>500) by decomposing space into independent subspaces.

Result: Theoretical foundations are formalized with proven invariance properties. Experimental validation shows the framework correctly identifies drift under coordinate rotation while existing methods produce false positives. The approach preserves high-dimensional structure without the computational burden of pairwise comparisons and provides cell-level spatial localization of distributional change.

Conclusion: Argus offers a principled geometric foundation for distribution monitoring that addresses the fundamental challenges of high-dimensional drift detection. It combines computational efficiency (O(N) complexity), preservation of geometric structure, invariance to orthogonal transformations, and scalability to very high dimensions through product quantization tessellation.

Abstract: Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity instability. This paper introduces Argus, A framework that reconceptualizes drift detection as tracking local statistics over a fixed spatial partition of the data manifold. The key contributions are fourfold. First, it is proved that Voronoi tessellations over canonical orthonormal frames yield drift metrics that are invariant to orthogonal transformations. The rotations and reflections that preserve Euclidean geometry. Second, it is established that this framework achieves O(N) complexity per snapshot while providing cell-level spatial localization of distributional change. Third, a graph-theoretic characterization of drift propagation is developed that distinguishes coherent distributional shifts from isolated perturbations. Fourth, product quantization tessellation is introduced for scaling to very high dimensions (d>500) by decomposing the space into independent subspaces and aggregating drift signals across subspaces. This paper formalizes the theoretical foundations, proves invariance properties, and presents experimental validation demonstrating that the framework correctly identifies drift under coordinate rotation while existing methods produce false positives. The tessellated approach offers a principled geometric foundation for distribution monitoring that preserves high-dimensional structure without the computational burden of pairwise comparisons.

[567] Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training

John Zhao

Main category: cs.LG

TL;DR: Muon++: A variant of Muon optimizer that guarantees μP spectral conditions throughout training by controlling optimizer updates instead of explicit weight normalization, enabling practical μP-compatible scaling for LLMs.

Details

Motivation: Existing matrix-based optimizers like Muon don't reliably maintain μP spectral conditions throughout training, requiring repeated spectral normalization that causes computational overhead and reduces practicality.

Method: Develop Muon++ by showing that for moderately large models, maintaining spectral control at optimizer updates alone is sufficient to preserve μP-compatible scaling, eliminating need for explicit weight normalization.

Result: Muon++ satisfies μP spectral conditions throughout training, bridging gap between μP theory and practical deployment of matrix-based optimizers in long-horizon LLM training.

Conclusion: The work enables practical μP-compatible training with matrix-based optimizers and takes first step toward adaptive spectral conditions incorporating data-dependent effects for long-horizon LLM training.

Abstract: The $μ$-parameterization ($μ$P) provides a principled foundation for large language model (LLM) training by prescribing width-independent learning dynamics, which in turn enables predictable scaling behavior and robust hyperparameter transfer across model sizes. A central requirement of $μ$P is the satisfaction of certain spectral conditions on weight matrices, which ensure consistent feature learning and optimization behavior as model width grows. While these conditions are well understood in theory, guaranteeing their validity in practical training for matrix-based optimizers such as Muon is still under studied. Existing works that study Muon under $μ$P exhibit important limitations: they either do not ensure that the spectral conditions hold throughout the entire training horizon, or require repeated spectral normalization (or Newton-Schulz iterations) applied to both weights and updates, leading to significant computational overhead and reduced practicality. In this work, we show how to reliably guarantee the spectral conditions required by $μ$P for Muon during the entire training process. Our key insight is that for moderately large models, maintaining spectral control at the level of optimizer updates alone is sufficient to preserve $μ$P-compatible scaling, eliminating the need for explicit spectral normalization of the weights. Based on this principle, we develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process. Our results bridge the gap between the theoretical promises of $μ$P and the practical deployment of matrix-based optimizers in long-horizon training. We also take the first step towards an adaptive spectral condition by incorporating data-dependent effects, making it better suited for long-horizon LLM training.

[568] Spectral-Window Hybrid (SWH)

Vladimer Khasia

Main category: cs.LG

TL;DR: SWH is a hybrid architecture that combines global spectral modeling with local window attention for efficient long-sequence processing, achieving Transformer-level performance with linear scaling.

Details

Motivation: Transformers have quadratic complexity that limits their application to long sequences, creating a need for architectures that balance computational efficiency with representational expressivity for extreme-context sequence modeling.

Method: Decouples sequence modeling into two parallel streams: (1) global branch using Convolution Theorem for long-range decay dynamics in O(T log T) time, and (2) local branch using sliding-window attention for token interactions within bounded context, then aggregates these representations.

Result: Matches perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences, avoiding computational bottleneck of global attention while retaining local precision.

Conclusion: SWH provides an effective solution for scaling sequence modeling to extreme contexts by combining the efficiency of spectral methods with the precision of local attention mechanisms.

Abstract: Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity. While Transformers provide precise retrieval via the attention mechanism, their quadratic $\mathcal{O}(T^2)$ complexity limits their application to long-horizon tasks. In this work, we propose the \textbf{Spectral-Window Hybrid (SWH)}, an architecture that decouples sequence modeling into two \textit{parallel} streams: a global branch utilizing the Convolution Theorem to model long-range decay dynamics in $\mathcal{O}(T \log T)$ time, and a local branch employing sliding-window attention for token interactions within a bounded context. By aggregating these representations, SWH avoids the computational bottleneck of global attention while retaining local precision. We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences. The code is available at https://github.com/VladimerKhasia/SWH

[569] From Classification to Generation: An Open-Ended Paradigm for Adverse Drug Reaction Prediction Based on Graph-Motif Feature Fusion

Yuyan Pi, Min Jin, Wentao Xie, Xinhua Liu

Main category: cs.LG

TL;DR: GM-MLG: Open-ended ADR prediction using graph-motif fusion and Transformer-based multi-label generation to overcome cold-start challenges and expand prediction space from 200 to 10,000+ ADR types.

Details

Motivation: Current ADR prediction methods face three main limitations: 1) cold-start challenge due to drug data scarcity, 2) closed label sets restricting prediction scope, and 3) inadequate modeling of label dependencies and co-occurrence relationships.

Method: Proposes GM-MLG with dual-graph representation spanning atomic, local molecular (using BRICS algorithm for motif extraction), and global molecular levels. Transforms ADR prediction from classification to Transformer Decoder-based multi-label generation, treating ADR labels as token sequences with positional embeddings to capture dependencies, using autoregressive decoding.

Result: Achieves up to 38% improvement with average 20% gain over baselines. Expands prediction space from 200 to over 10,000 ADR types. Provides interpretable structure-activity relationships through retrosynthetic motif analysis.

Conclusion: GM-MLG offers an open-ended paradigm for ADR prediction that overcomes cold-start challenges, captures label dependencies, and significantly expands prediction capabilities while providing interpretable insights for drug safety risk reduction.

Abstract: Computational biology offers immense potential for reducing the high costs and protracted cycles of new drug development through adverse drug reaction (ADR) prediction. However, current methods remain impeded by drug data scarcity-induced cold-start challenge, closed label sets, and inadequate modeling of label dependencies. Here we propose an open-ended ADR prediction paradigm based on Graph-Motif feature fusion and Multi-Label Generation (GM-MLG). Leveraging molecular structure as an intrinsic and inherent feature, GM-MLG constructs a dual-graph representation architecture spanning the atomic level, the local molecular level (utilizing fine-grained motifs dynamically extracted via the BRICS algorithm combined with additional fragmentation rules), and the global molecular level. Uniquely, GM-MLG pioneers transforming ADR prediction from multi-label classification into Transformer Decoder-based multi-label generation. By treating ADR labels as discrete token sequences, it employs positional embeddings to explicitly capture dependencies and co-occurrence relationships within large-scale label spaces, generating predictions via autoregressive decoding to dynamically expand the prediction space. Experiments demonstrate GM-MLG achieves up to 38% improvement and an average gain of 20%, expanding the prediction space from 200 to over 10,000 types. Furthermore, it elucidates non-linear structure-activity relationships between ADRs and motifs via retrosynthetic motif analysis, providing interpretable and innovative support for systematic risk reduction in drug safety.

[570] Towards LLM-enabled autonomous combustion research: A literature-aware agent for self-corrective modeling workflows

Ke Xiao, Haoze Zhang, Runze Mao, Han Li, Zhi X. Chen

Main category: cs.LG

TL;DR: FlamePilot is an LLM agent that automates combustion modeling workflows by integrating domain knowledge from scientific literature with robust CFD simulation execution, achieving state-of-the-art performance in autonomous research assistance.

Details

Motivation: Current LLMs lack integration with domain-specific scientific tools like CFD codes for complex fields like combustion modeling, creating a gap between AI's potential as research partners and practical application in expertise-intensive domains.

Method: FlamePilot uses an architecture with atomic tools for robust CFD simulation setup/execution in OpenFOAM and DeepFlame, learns from scientific articles to extract key information, and implements self-corrective workflows with transparent, interpretable operations.

Result: Achieved perfect 1.0 executability score and 0.438 success rate on public benchmark, surpassing prior best agent scores of 0.625 and 0.250. Successfully demonstrated autonomous translation of research papers into configured simulations, execution, post-processing, and parameter studies for MILD combustion.

Conclusion: FlamePilot establishes a foundational framework for AI-empowered combustion modeling, enabling collaborative partnerships where AI manages workflow orchestration while researchers focus on high-level analysis, bridging the gap between AI capabilities and practical scientific research needs.

Abstract: The rapid evolution of large language models (LLMs) is transforming artificial intelligence into autonomous research partners, yet a critical gap persists in complex scientific domains such as combustion modeling. Here, practical AI assistance requires the seamless integration of domain literature knowledge with robust execution capabilities for expertise-intensive tools such as computational fluid dynamics (CFD) codes. To bridge this gap, we introduce FlamePilot, an LLM agent designed to empower combustion modeling research through automated and self-corrective CFD workflows. FlamePilot differentiates itself through an architecture that leverages atomic tools to ensure the robust setup and execution of complex simulations in both OpenFOAM and extended frameworks such as DeepFlame. The system is also capable of learning from scientific articles, extracting key information to guide the simulation from initial setup to optimized results. Validation on a public benchmark shows FlamePilot achieved a perfect 1.0 executability score and a 0.438 success rate, surpassing the prior best reported agent scores of 0.625 and 0.250, respectively. Furthermore, a detailed case study on Moderate or Intense Low-oxygen Dilution (MILD) combustion simulation demonstrates its efficacy as a collaborative research copilot, where FlamePilot autonomously translated a research paper into a configured simulation, conducted the simulation, post-processed the results, proposed evidence-based refinements, and managed a multi-step parameter study to convergence under minimal human intervention. By adopting a transparent and interpretable paradigm, FlamePilot establishes a foundational framework for AI-empowered combustion modeling, fostering a collaborative partnership where the agent manages workflow orchestration, freeing the researcher for high-level analysis.

[571] Causal discovery for linear causal model with correlated noise: an Adversarial Learning Approach

Mujin Zhou, Junzhe Zhang

Main category: cs.LG

TL;DR: Proposes f-GAN-based causal discovery method that learns binary causal structure independent of specific weights, reformulating structure learning as minimizing Bayesian free energy equivalent to f-divergence minimization.

Details

Motivation: Causal discovery from data with unmeasured confounding factors is challenging. Existing methods often rely on specific weight values or have limitations in handling unmeasured confounding.

Method: Reformulates structure learning as minimizing Bayesian free energy, proves equivalence to f-divergence minimization, uses f-GAN framework for min-max adversarial optimization, implements gradient search in discrete graph space with Gumbel-Softmax relaxation.

Result: Theoretical framework established showing equivalence between Bayesian free energy minimization and f-divergence minimization. Method enables learning causal structure independent of specific weight values.

Conclusion: Proposed f-GAN-based approach provides a principled framework for causal discovery with unmeasured confounding, enabling gradient-based optimization in discrete graph space through Gumbel-Softmax relaxation.

Abstract: Causal discovery from data with unmeasured confounding factors is a challenging problem. This paper proposes an approach based on the f-GAN framework, learning the binary causal structure independent of specific weight values. We reformulate the structure learning problem as minimizing Bayesian free energy and prove that this problem is equivalent to minimizing the f-divergence between the true data distribution and the model-generated distribution. Using the f-GAN framework, we transform this objective into a min-max adversarial optimization problem. We implement the gradient search in the discrete graph space using Gumbel-Softmax relaxation.

[572] Data Complexity-aware Deep Model Performance Forecasting

Yen-Chia Chen, Hsing-Kuo Pao, Hanjuan Huang

Main category: cs.LG

TL;DR: A lightweight two-stage framework predicts deep learning model performance before training using dataset properties and model architecture details, enabling performance forecasting, architecture guidance, and data quality assessment.

Details

Motivation: Current model selection relies on time-consuming trial-and-error procedures that are resource-intensive and difficult to automate. Existing performance prediction methods require significant computational overhead or lack generalizability.

Method: A two-stage framework: Stage 1 predicts baseline performance using measurable dataset properties; Stage 2 adjusts estimation with model architectural and hyperparameter details. Uses features like dataset variance for prediction.

Result: The framework generalizes across datasets and model types. Dataset features used for prediction (like dataset variance) provide practical guidance for model selection and serve as early indicators of data quality.

Conclusion: The framework enables forecasting model performance before training, guiding architecture choices, informing preprocessing procedures, and detecting problematic datasets early, reducing trial-and-error overhead in model induction.

Abstract: Deep learning models are widely used across computer vision and other domains. When working on the model induction, selecting the right architecture for a given dataset often relies on repetitive trial-and-error procedures. This procedure is time-consuming, resource-intensive, and difficult to automate. While previous work has explored performance prediction using partial training or complex simulations, these methods often require significant computational overhead or lack generalizability. In this work, we propose an alternative approach: a lightweight, two-stage framework that can estimate model performance before training given the understanding of the dataset and the focused deep model structures. The first stage predicts a baseline based on the analysis of some measurable properties of the dataset, while the second stage adjusts the estimation with additional information on the model’s architectural and hyperparameter details. The setup allows the framework to generalize across datasets and model types. Moreover, we find that some of the underlying features used for prediction - such as dataset variance - can offer practical guidance for model selection, and can serve as early indicators of data quality. As a result, the framework can be used not only to forecast model performance, but also to guide architecture choices, inform necessary preprocessing procedures, and detect potentially problematic datasets before training begins.

[573] Scale-Adaptive Power Flow Analysis with Local Topology Slicing and Multi-Task Graph Learning

Yongzhe Li, Lin Guan, Zihan Cai, Zuxian Lin, Jiyu Huang, Liukai Chen

Main category: cs.LG

TL;DR: SaMPFA framework enhances power flow analysis with scale adaptability using local topology slicing and multi-task graph learning to predict voltages and powers directly, improving robustness and physical consistency.

Details

Motivation: Need for deep learning models with strong adaptability to topological variations in power flow analysis, especially for variable system scales and robust branch power prediction.

Method: Proposes SaMPFA framework with: 1) Local Topology Slicing (LTS) sampling to extract subgraphs of different scales for cross-scale learning, 2) Reference-free Multi-task Graph Learning (RMGL) model that predicts bus voltages and branch powers (instead of phase angles) to avoid error amplification and learn physical relationships, 3) Loss function with extra terms to capture physical patterns of angle differences and power transmission.

Result: Superior adaptability and generalization under variable system scales on IEEE 39-bus system and real provincial grid in China, with accuracy improvements of 4.47% and 36.82% respectively.

Conclusion: The proposed SaMPFA framework effectively enhances model performance for power flow analysis under topological variations, achieving better scale adaptability and physical consistency than existing approaches.

Abstract: Developing deep learning models with strong adaptability to topological variations is of great practical significance for power flow analysis. To enhance model performance under variable system scales and improve robustness in branch power prediction, this paper proposes a Scale-adaptive Multi-task Power Flow Analysis (SaMPFA) framework. SaMPFA introduces a Local Topology Slicing (LTS) sampling technique that extracts subgraphs of different scales from the complete power network to strengthen the model’s cross-scale learning capability. Furthermore, a Reference-free Multi-task Graph Learning (RMGL) model is designed for robust power flow prediction. Unlike existing approaches, RMGL predicts bus voltages and branch powers instead of phase angles. This design not only avoids the risk of error amplification in branch power calculation but also guides the model to learn the physical relationships of phase angle differences. In addition, the loss function incorporates extra terms that encourage the model to capture the physical patterns of angle differences and power transmission, further improving consistency between predictions and physical laws. Simulations on the IEEE 39-bus system and a real provincial grid in China demonstrate that the proposed model achieves superior adaptability and generalization under variable system scales, with accuracy improvements of 4.47% and 36.82%, respectively.

[574] A Graph-based Framework for Online Time Series Anomaly Detection Using Model Ensemble

Zewei Yu, Jianqiu Xu, Caimin Li

Main category: cs.LG

TL;DR: GDME is an unsupervised graph-based framework for online time series anomaly detection that uses dynamic model ensembles and community detection to adapt to evolving streaming data patterns.

Details

Motivation: With increasing streaming data in industrial systems, existing anomaly detection methods struggle with heterogeneous, rapidly evolving data patterns in online settings. Most methods are designed for offline use or can't effectively handle diverse streaming data.

Method: GDME maintains a dynamic model pool that’s continuously updated by pruning underperforming models and adding new ones. It uses a dynamic graph structure to represent model relationships, employs community detection to select ensemble subsets, and monitors graph structural changes to detect concept drift for adaptation.

Result: Experiments on seven heterogeneous time series show GDME outperforms existing online anomaly detection methods by up to 24%. Its ensemble strategy provides superior detection performance compared to individual models and average ensembles, with competitive computational efficiency.

Conclusion: GDME is an effective unsupervised framework for online time series anomaly detection that adapts to evolving streaming data through dynamic model ensembles and graph-based community detection, offering significant performance improvements over existing methods.

Abstract: With the increasing volume of streaming data in industrial systems, online anomaly detection has become a critical task. The diverse and rapidly evolving data patterns pose significant challenges for online anomaly detection. Many existing anomaly detection methods are designed for offline settings or have difficulty in handling heterogeneous streaming data effectively. This paper proposes GDME, an unsupervised graph-based framework for online time series anomaly detection using model ensemble. GDME maintains a dynamic model pool that is continuously updated by pruning underperforming models and introducing new ones. It utilizes a dynamic graph structure to represent relationships among models and employs community detection on the graph to select an appropriate subset for ensemble. The graph structure is also used to detect concept drift by monitoring structural changes, allowing the framework to adapt to evolving streaming data. Experiments on seven heterogeneous time series demonstrate that GDME outperforms existing online anomaly detection methods, achieving improvements of up to 24%. In addition, its ensemble strategy provides superior detection performance compared with both individual models and average ensembles, with competitive computational efficiency.

[575] A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory

Itay Safran

Main category: cs.LG

TL;DR: First unconditional super-linear lower bound for computing maximum function with ReLU networks: width Ω(d^(1+1/(2^(k-2)-1))) needed for depth 3≤k≤log₂(log₂(d)).

Details

Motivation: Understanding the inherent complexity of fundamental operations like maximum function in neural networks, and establishing lower bounds for deep networks.

Method: Combinatorial argument associating non-differentiable ridges of maximum with cliques in graph induced by first hidden layer, using Turán’s theorem from extremal graph theory.

Result: Proved depth hierarchy: width Ω(d^(1+1/(2^(k-2)-1))) necessary to represent maximum function for depth 3≤k≤log₂(log₂(d)), even if depth scales with d.

Conclusion: Maximum function has inherent complexity from geometric structure of its non-differentiable hyperplanes; provides novel approach for proving lower bounds for deep neural networks.

Abstract: We consider the problem of exact computation of the maximum function over $d$ real inputs using ReLU neural networks. We prove a depth hierarchy, wherein width $Ω\big(d^{1+\frac{1}{2^{k-2}-1}}\big)$ is necessary to represent the maximum for any depth $3\le k\le \log_2(\log_2(d))$. This is the first unconditional super-linear lower bound for this fundamental operator at depths $k\ge3$, and it holds even if the depth scales with $d$. Our proof technique is based on a combinatorial argument and associates the non-differentiable ridges of the maximum with cliques in a graph induced by the first hidden layer of the computing network, utilizing Turán’s theorem from extremal graph theory to show that a sufficiently narrow network cannot capture the non-linearities of the maximum. This suggests that despite its simple nature, the maximum function possesses an inherent complexity that stems from the geometric structure of its non-differentiable hyperplanes, and provides a novel approach for proving lower bounds for deep neural networks.

[576] Unveiling the Heart-Brain Connection: An Analysis of ECG in Cognitive Performance

Akshay Sasi, Malavika Pradeep, Nusaibah Farrukh, Rahul Venugopal, Elizabeth Sherly

Main category: cs.LG

TL;DR: ECG signals can effectively reflect cognitive load and serve as proxies for EEG-based indicators using a cross-modal XGBoost framework that projects ECG features onto EEG-representative cognitive spaces.

Details

Motivation: EEG is the gold standard for mental workload assessment but has limited portability for real-world use. ECG from wearable devices offers a more practical alternative for everyday cognitive monitoring.

Method: Collected multimodal data from working-memory and passive-listening tasks. Extracted ECG time-domain HRV metrics and Catch22 descriptors, and EEG spectral/Catch22 features. Developed a cross-modal XGBoost framework to project ECG features onto EEG-representative cognitive spaces for workload inference.

Result: ECG-derived projections effectively capture cognitive state variations and support accurate classification. The approach demonstrates ECG can serve as a reliable proxy for EEG-based cognitive load indicators.

Conclusion: ECG provides an interpretable, real-time, wearable solution for everyday cognitive monitoring, offering a practical alternative to EEG for assessing mental workload in real-world settings.

Abstract: Understanding the interaction of neural and cardiac systems during cognitive activity is critical to advancing physiological computing. Although EEG has been the gold standard for assessing mental workload, its limited portability restricts its real-world use. Widely available ECG through wearable devices proposes a pragmatic alternative. This research investigates whether ECG signals can reliably reflect cognitive load and serve as proxies for EEG-based indicators. In this work, we present multimodal data acquired from two different paradigms involving working-memory and passive-listening tasks. For each modality, we extracted ECG time-domain HRV metrics and Catch22 descriptors against EEG spectral and Catch22 features, respectively. We propose a cross-modal XGBoost framework to project the ECG features onto EEG-representative cognitive spaces, thereby allowing workload inferences using only ECG. Our results show that ECG-derived projections expressively capture variation in cognitive states and provide good support for accurate classification. Our findings underpin ECG as an interpretable, real-time, wearable solution for everyday cognitive monitoring.

[577] Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models

Jian Feng, Zhihong Huang

Main category: cs.LG

TL;DR: BSZO introduces Bayesian subspace zeroth-order optimization that uses Kalman filtering to combine gradient information across multiple perturbation directions, improving convergence rates while maintaining low memory usage comparable to inference-only baselines.

Details

Motivation: Existing zeroth-order optimization methods for fine-tuning LLMs rely on one-step gradient estimates from random perturbations, which limits their efficiency and convergence performance.

Method: BSZO applies Kalman filtering to combine finite-difference information across multiple perturbation directions, treating each measurement as noisy observation to build posterior distribution over projected gradient, with residual-based adaptive mechanism for perturbation scales.

Result: Theoretical analysis shows BSZO improves convergence rate by factor of k/γ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show BSZO outperforms MeZO, MeZO-Adam, and HiZOO across tasks, achieving up to 6.67% absolute average improvement on OPT-13B while keeping memory usage at 1.00×-1.08× of MeZO.

Conclusion: BSZO provides an effective Bayesian approach to zeroth-order optimization that significantly improves performance while maintaining the memory efficiency advantages of ZO methods for LLM fine-tuning.

Abstract: Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations, but existing methods rely on one-step gradient estimates from random perturbations. We introduce Bayesian Subspace Zeroth-Order optimization (BSZO), a ZO optimizer that applies Kalman filtering to combine finite-difference information across multiple perturbation directions. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adjust perturbation scales. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms MeZO, MeZO-Adam, and HiZOO across various tasks, achieving up to 6.67% absolute average improvement on OPT-13B while keeping memory usage close to inference-only baselines (1.00$\times$–1.08$\times$ of MeZO).

[578] Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD

Ze Peng, Jian Zhang, Yisen Wang, Lei Qi, Yinghuan Shi, Yang Gao

Main category: cs.LG

TL;DR: The paper proposes a new information-theoretic generalization bound that better captures SGD’s flatness bias, showing improved generalization under better flatness and tighter numerical bounds.

Details

Motivation: Existing information-theoretic generalization bounds fail to capture SGD's improved generalization under better flatness (flatness bias) and are numerically loose, due to inadequate leverage of SGD's flatness bias in current bounds.

Method: Derives a new information-theoretic bound that better leverages SGD’s flatness bias using a flexible technique called “omniscient trajectory”. The bound connects generalization to the relationship between weight covariance directions and local curvatures in the loss landscape.

Result: The new bound correctly reflects better generalization when flatness is improved, is numerically much tighter on deep neural networks, improves representative IT bounds’ rates from Ω(1) to O(1/√n) for Gradient Descent, and implies a bypass of memorization-generalization trade-offs.

Conclusion: The proposed flatness-leveraging information-theoretic bound successfully captures SGD’s flatness bias, providing both theoretical and empirical improvements over existing bounds, with implications for understanding generalization in deep learning.

Abstract: Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD’s generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD’s flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically much tighter. This is achieved by a flexible technique called “omniscient trajectory”. When applied to Gradient Descent’s minimax excess risk on convex-Lipschitz-Bounded problems, it improves representative IT bounds’ $Ω(1)$ rates to $O(1/\sqrt{n})$. It also implies a by-pass of memorization-generalization trade-offs.

[579] Accelerating Storage-Based Training for Graph Neural Networks

Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim

Main category: cs.LG

TL;DR: AGNES is a storage-based GNN training framework that addresses I/O bottlenecks by using block-wise storage I/O processing and hyperbatch-based processing to handle web-scale graphs efficiently.

Details

Motivation: Storage-based GNN training methods face severe bottlenecks in data preparation due to handling large numbers of small storage I/Os when dealing with web-scale graphs on single machines with external storage like NVMe SSDs.

Method: AGNES employs block-wise storage I/O processing to fully utilize I/O bandwidth of high-performance storage devices, and hyperbatch-based processing that leverages characteristics of real-world graphs to enhance storage I/O efficiency.

Result: Comprehensive experiments on five real-world graphs show AGNES consistently outperforms four state-of-the-art methods, achieving up to 4.1× faster training than the best competitor.

Conclusion: AGNES effectively addresses the I/O bottleneck in storage-based GNN training through optimized block-wise and hyperbatch-based processing, enabling efficient training of web-scale graphs on single machines.

Abstract: Graph neural networks (GNNs) have achieved breakthroughs in various real-world downstream tasks due to their powerful expressiveness. As the scale of real-world graphs has been continuously growing, \textit{a storage-based approach to GNN training} has been studied, which leverages external storage (e.g., NVMe SSDs) to handle such web-scale graphs on a single machine. Although such storage-based GNN training methods have shown promising potential in large-scale GNN training, we observed that they suffer from a severe bottleneck in data preparation since they overlook a critical challenge: \textit{how to handle a large number of small storage I/Os}. To address the challenge, in this paper, we propose a novel storage-based GNN training framework, named \textsf{AGNES}, that employs a method of \textit{block-wise storage I/O processing} to fully utilize the I/O bandwidth of high-performance storage devices. Moreover, to further enhance the efficiency of each storage I/O, \textsf{AGNES} employs a simple yet effective strategy, \textit{hyperbatch-based processing} based on the characteristics of real-world graphs. Comprehensive experiments on five real-world graphs reveal that \textsf{AGNES} consistently outperforms four state-of-the-art methods, by up to 4.1$\times$ faster than the best competitor. Our code is available at https://github.com/Bigdasgit/agnes-kdd26.

Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

Main category: cs.LG

TL;DR: Proposes MoLR-MoG modeling for diffusion models to escape the curse of dimensionality by capturing multi-modal latent structure, achieving better generation with fewer parameters and faster optimization.

Details

Motivation: Current diffusion models suffer from curse of dimensionality (n^{-1/D} error) and Gaussian latent fails to capture multi-modal properties of real-world data manifolds. Need better modeling that reflects both multi-manifold and multi-modal nature of images.

Method: Proposes Mixture of Low-Rank Mixture of Gaussian (MoLR-MoG) modeling: data as union of K linear subspaces, each with mixture of Gaussian latent (n_k modals, d_k dimension). Score function has mixture of expert (MoE) structure to capture multi-modal information and nonlinear properties.

Result: MoE-latent MoG NN outperforms MoE-latent Gaussian score, achieves comparable performance to MoE-latent Unet with 10× fewer parameters. Provides estimation error bound R^4√(Σn_k)√(Σn_kd_k)/√n that escapes dimensionality curse. Proves convergence guarantee for optimization process.

Conclusion: MoLR-MoG modeling explains why diffusion models work well with small datasets and fast optimization - by properly capturing data structure, it escapes dimensionality curse while maintaining strong generation performance with efficient parameterization.

Abstract: Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with $10 \times$ parameters. These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a $R^4\sqrt{Σ_{k=1}^Kn_k}\sqrt{Σ_{k=1}^Kn_kd_k}/\sqrt{n}$ estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.

[581] SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

Itai Morad, Nir Shlezinger, Yonina C. Eldar

Main category: cs.LG

TL;DR: Bayesian analysis of Knowledge Distillation shows that using Bayesian Class Probabilities as teacher outputs reduces variance and improves student convergence compared to one-hot labels, with Bayesian teachers providing better performance.

Details

Motivation: Knowledge Distillation has strong empirical success but lacks theoretical understanding. The paper aims to provide rigorous theoretical analysis of KD convergence behavior from a Bayesian perspective.

Method: Adopts Bayesian perspective to analyze student convergence with SGD. Studies two regimes: (1) exact Bayes Class Probabilities from teacher, (2) noisy approximations of BCPs. Uses theoretical analysis and experimental validation with Bayesian deep learning models as teachers.

Result: Learning from BCPs yields variance reduction and removes neighborhood terms in convergence bounds. Bayesian teachers provide better BCP estimates, leading to students with higher accuracies (up to +4.27%) and more stable convergence (up to 30% less noise) compared to deterministic teachers.

Conclusion: Bayesian perspective provides theoretical foundation for KD, showing benefits of using Bayesian teachers. Advocates for Bayesian deep learning models as teachers in KD due to improved BCP estimates and better student performance.

Abstract: Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: $(i)$ when the teacher provides the exact Bayes Class Probabilities (BCPs); and $(ii)$ supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.

[582] Entropy-Aligned Decoding of LMs for Better Writing and Reasoning

Kareem Ahmed, Sameer Singh

Main category: cs.LG

TL;DR: EPIC is a hyperparameter-free decoding method that incorporates future trajectory entropy into language model sampling, aligning generation uncertainty with data uncertainty to produce higher quality, more diverse outputs.

Details

Motivation: Current LM decoding algorithms use greedy heuristics that introduce myopic distortions, resulting in homogeneous, repetitive, and incoherent generations. There's a need for decoding methods that better align with the true language distribution.

Method: EPIC uses Entropy-Aware Lazy Gumbel-Max sampling to incorporate future trajectory entropy into LM decoding. It explicitly regulates uncertainty at each generation step by aligning sampling distribution entropy with aleatoric (data) uncertainty, requiring only sublinear entropy evaluations per step.

Result: EPIC consistently improves LM-as-judge preference win-rates over widely used decoding strategies in creative writing and summarization. It produces more diverse generations and more faithful summaries, and outperforms all baselines on mathematical reasoning tasks.

Conclusion: EPIC provides an effective, hyperparameter-free decoding approach that better aligns LM sampling with the true data distribution, addressing limitations of current greedy decoding methods and improving generation quality across multiple domains.

Abstract: Language models (LMs) are trained on billions of tokens in an attempt to recover the true language distribution. Still, vanilla random sampling from LMs yields low quality generations. Decoding algorithms attempt to restrict the LM distribution to a set of high-probability continuations, but rely on greedy heuristics that introduce myopic distortions, yielding sentences that are homogeneous, repetitive and incoherent. In this paper, we introduce EPIC, a hyperparameter-free decoding approach that incorporates the entropy of future trajectories into LM decoding. EPIC explicitly regulates the amount of uncertainty expressed at every step of generation, aligning the sampling distribution’s entropy to the aleatoric (data) uncertainty. Through Entropy-Aware Lazy Gumbel-Max sampling, EPIC manages to be exact, while also being efficient, requiring only a sublinear number of entropy evaluations per step. Unlike current baselines, EPIC yields sampling distributions that are empirically well-aligned with the entropy of the underlying data distribution. Across creative writing and summarization tasks, EPIC consistently improves LM-as-judge preference win-rates over widely used decoding strategies. These preference gains are complemented by automatic metrics, showing that EPIC produces more diverse generations and more faithful summaries. We also evaluate EPIC on mathematical reasoning, where it outperforms all baselines.

[583] Accelerating Decentralized Optimization via Overlapping Local Steps

Yijie Zhou, Shi Pu

Main category: cs.LG

TL;DR: OLDSGD is a decentralized optimization method that overlaps computation and communication to reduce network idle time while preserving the same average update as Local SGD.

Details

Motivation: Decentralized optimization enables scalable distributed learning with data privacy, but existing methods suffer from communication bottlenecks due to frequent synchronization between nodes.

Method: Overlapping Local Decentralized SGD (OLDSGD) uses computation-communication overlapping with a deliberately designed update to avoid communication-induced stalls while preserving the same average update as Local SGD.

Result: Theoretically, OLDSGD retains the same iteration complexity as standard Local Decentralized SGD while improving per-iteration runtime. Empirically, it shows consistent improvements in wall-clock time convergence under different communication delays.

Conclusion: OLDSGD offers a practical solution for faster decentralized learning without sacrificing theoretical guarantees, requiring minimal modifications to existing frameworks.

Abstract: Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication bottlenecks due to frequent synchronization between nodes. We present Overlapping Local Decentralized SGD (OLDSGD), a novel approach to accelerate decentralized training by computation-communication overlapping, significantly reducing network idle time. With a deliberately designed update, OLDSGD preserves the same average update as Local SGD while avoiding communication-induced stalls. Theoretically, we establish non-asymptotic convergence rates for smooth non-convex objectives, showing that OLDSGD retains the same iteration complexity as standard Local Decentralized SGD while improving per-iteration runtime. Empirical results demonstrate OLDSGD’s consistent improvements in wall-clock time convergence under different levels of communication delays. With minimal modifications to existing frameworks, OLDSGD offers a practical solution for faster decentralized learning without sacrificing theoretical guarantees.

[584] Advanced Global Wildfire Activity Modeling with Hierarchical Graph ODE

Fan Xu, Wei Gong, Hao Wu, Lilan Peng, Nan Wang, Qingsong Wen, Xian Wu, Kun Wang, Xibin Zhao

Main category: cs.LG

TL;DR: HiGO is a novel deep learning framework for global wildfire forecasting that uses hierarchical graph neural networks with Neural ODEs to model multi-scale, continuous-time wildfire dynamics, outperforming existing methods on long-range predictions.

Details

Motivation: Wildfires are complex Earth system phenomena governed by multi-scale atmospheric, oceanic, and terrestrial processes. While deep learning has advanced global weather forecasting, its application to global wildfire behavior prediction remains underexplored despite being critical for understanding and managing wildfire risks.

Method: HiGO represents the Earth system as a multi-level graph hierarchy and uses adaptive filtering message passing for intra- and inter-level information flow. It incorporates GNN-parameterized Neural ODE modules at multiple levels to explicitly learn continuous dynamics at each scale, enabling effective feature extraction and fusion across spatiotemporal scales.

Result: HiGO significantly outperforms state-of-the-art baselines on long-range wildfire forecasting using the SeasFire Cube dataset. The continuous-time predictions demonstrate strong observational consistency, highlighting the framework’s potential for real-world applications.

Conclusion: The HiGO framework successfully addresses the challenge of modeling global wildfire activity by capturing multi-scale, continuous-time dynamics through hierarchical graph representations and Neural ODEs, offering a promising approach for improving wildfire forecasting capabilities.

Abstract: Wildfires, as an integral component of the Earth system, are governed by a complex interplay of atmospheric, oceanic, and terrestrial processes spanning a vast range of spatiotemporal scales. Modeling their global activity on large timescales is therefore a critical yet challenging task. While deep learning has recently achieved significant breakthroughs in global weather forecasting, its potential for global wildfire behavior prediction remains underexplored. In this work, we reframe this problem and introduce the Hierarchical Graph ODE (HiGO), a novel framework designed to learn the multi-scale, continuous-time dynamics of wildfires. Specifically, we represent the Earth system as a multi-level graph hierarchy and propose an adaptive filtering message passing mechanism for both intra- and inter-level information flow, enabling more effective feature extraction and fusion. Furthermore, we incorporate GNN-parameterized Neural ODE modules at multiple levels to explicitly learn the continuous dynamics inherent to each scale. Through extensive experiments on the SeasFire Cube dataset, we demonstrate that HiGO significantly outperforms state-of-the-art baselines on long-range wildfire forecasting. Moreover, its continuous-time predictions exhibit strong observational consistency, highlighting its potential for real-world applications.

[585] Context-Free Recognition with Transformers

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

Main category: cs.LG

TL;DR: Transformers with O(log n) looping layers and O(n^6) padding tokens can recognize all context-free languages, but O(n^3) padding suffices for unambiguous CFLs.

Details

Motivation: Transformers excel at processing well-formed inputs like natural language and code, but it's unclear how they process grammatical syntax. Prior work showed transformers can't recognize context-free languages (CFLs) under standard complexity conjectures, and while O(log n) looping enables regular language recognition, CFL recognition remained open.

Method: The paper proposes looped transformers with O(log n) looping layers and padding tokens. For general CFL recognition, O(n^6) padding tokens are required. For unambiguous CFLs, the method becomes more tractable with O(n^3) padding tokens.

Result: Theoretical results show transformers with O(log n) looping layers and O(n^6) padding can recognize all CFLs. For unambiguous CFLs, only O(n^3) padding is needed. Empirical validation shows looping helps on languages that provably require logarithmic depth.

Conclusion: While general CFL recognition may require intractable O(n^6) padding, natural constraints like unambiguity yield efficient O(n^3) padding algorithms. The results clarify the complexity of CFL recognition by transformers, showing practical recognition is possible for important subclasses.

Abstract: Transformers excel on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs (Merrill et al., 2022). Merrill & Sabharwal (2024) show that $\mathcal{O}(\log n)$ looping layers (w.r.t. input length $n$) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $\mathcal{O}(\log n)$ looping layers and $\mathcal{O}(n^6)$ padding tokens can recognize all CFLs. However, training and inference with $\mathcal{O}(n^6)$ padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(n^3)$ padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

[586] Utilizing Earth Foundation Models to Enhance the Simulation Performance of Hydrological Models with AlphaEarth Embeddings

Pengfei Qu, Wenyu Ouyang, Chi Zhang, Yikai Chai, Shuolong Xu, Lei Ye, Yongri Piao, Miao Zhang, Huchuan Lu

Main category: cs.LG

TL;DR: Satellite image embeddings outperform traditional basin attributes for predicting river flow in ungauged basins by better capturing environmental complexity and improving donor basin selection.

Details

Motivation: Traditional basin attributes fail to fully represent the complex interactions between climate, terrain, vegetation, and soils that affect river flow, making predictions in ungauged basins challenging. There's a need for more informative representations of basin characteristics.

Method: Use AlphaEarth Foundation embeddings learned from large satellite image collections to describe basin characteristics. These embeddings capture vegetation patterns, land surface properties, and long-term environmental dynamics. Compare models using these embeddings against traditional basin attributes for predicting river flow in ungauged basins.

Result: Models using satellite embeddings achieve higher accuracy for predicting flows in basins not used for training. Embedding-based similarity helps identify appropriate donor basins with comparable environmental and hydrological behavior, improving performance, while adding dissimilar basins reduces accuracy.

Conclusion: Satellite-informed environmental representations can strengthen hydrological forecasting and support the development of more adaptable models that work across different landscapes by better capturing key physical differences between basins.

Abstract: Predicting river flow in places without streamflow records is challenging because basins respond differently to climate, terrain, vegetation, and soils. Traditional basin attributes describe some of these differences, but they cannot fully represent the complexity of natural environments. This study examines whether AlphaEarth Foundation embeddings, which are learned from large collections of satellite images rather than designed by experts, offer a more informative way to describe basin characteristics. These embeddings summarize patterns in vegetation, land surface properties, and long-term environmental dynamics. We find that models using them achieve higher accuracy when predicting flows in basins not used for training, suggesting that they capture key physical differences more effectively than traditional attributes. We further investigate how selecting appropriate donor basins influences prediction in ungauged regions. Similarity based on the embeddings helps identify basins with comparable environmental and hydrological behavior, improving performance, whereas adding many dissimilar basins can reduce accuracy. The results show that satellite-informed environmental representations can strengthen hydrological forecasting and support the development of models that adapt more easily to different landscapes.

[587] The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Zibo Zhao, Yuanting Zha, Haipeng Zhang, Xingcheng Xu

Main category: cs.LG

TL;DR: RL post-training enables self-reflection in LLMs through balanced gradient attribution across policy components, explaining why RL outperforms SFT in self-correction tasks.

Details

Motivation: The paper aims to understand the opaque mechanism behind how unified RL optimization objectives give rise to functionally distinct capabilities in LLMs - generating solutions and evaluating when to revise them (self-reflection).

Method: Introduces Gradient Attribution Property to characterize reward gradient distribution across policy components, formalized through Two-Stage Decision-Sampling Hypothesis that decomposes policy into sampling (π_sample) for generation and decision (π_d) for verification. Theoretically analyzes different training regimes and empirically validates on arithmetic reasoning tasks.

Result: Proves that surrogate rewards exhibit Balanced Gradient Attribution while SFT and KL penalties exhibit Unbalanced Gradient Attribution. Length-weighting creates asymmetric regularization that constrains π_sample while leaving π_d under-optimized. Empirical validation shows RL’s superior generalization stems primarily from improved decision-making (π_d) rather than sampling capabilities.

Conclusion: Provides a first-principles mechanistic explanation for self-correction in thinking models, explaining why RL succeeds where SFT fails through balanced gradient attribution across policy components that enables proper optimization of both generation and verification capabilities.

Abstract: Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ($π_{sample}$) for generation and decision ($π_{d}$) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains $π_{sample}$ while leaving $π_{d}$ under-optimized, providing an theoretical explanation of why RL succeeds where SFT fails. We also empirically validate our theoretical predictions on arithmetic reasoning demonstrates that RL’s superior generalization stems primarily from improved decision-making ($π_{d}$) rather than sampling capabilities, providing a first-principles mechanistic explanation for self-correction in thinking models.

[588] REE-TTT: Highly Adaptive Radar Echo Extrapolation Based on Test-Time Training

Xin Di, Xinglin Piao, Fei Wang, Guodong Jing, Yong Zhang

Main category: cs.LG

TL;DR: REE-TTT introduces a Test-Time Training mechanism with spatio-temporal attention blocks to improve precipitation nowcasting generalization across regions and extreme events.

Details

Motivation: Current deep learning-based Radar Echo Extrapolation (REE) methods suffer from poor generalization due to reliance on local training data and static parameters, limiting applicability across diverse regions and extreme precipitation events.

Method: Proposes REE-TTT with Spatio-temporal Test-Time Training (ST-TTT) blocks that replace standard linear projections in TTT layers with task-specific attention mechanisms, enabling adaptation to non-stationary meteorological distributions.

Result: Experiments under cross-regional extreme precipitation scenarios show REE-TTT substantially outperforms state-of-the-art baselines in prediction accuracy and generalization, demonstrating remarkable adaptability to data distribution shifts.

Conclusion: REE-TTT successfully addresses generalization limitations in precipitation nowcasting through adaptive test-time training, making it more applicable across diverse meteorological conditions and regions.

Abstract: Precipitation nowcasting is critically important for meteorological forecasting. Deep learning-based Radar Echo Extrapolation (REE) has become a predominant nowcasting approach, yet it suffers from poor generalization due to its reliance on high-quality local training data and static model parameters, limiting its applicability across diverse regions and extreme events. To overcome this, we propose REE-TTT, a novel model that incorporates an adaptive Test-Time Training (TTT) mechanism. The core of our model lies in the newly designed Spatio-temporal Test-Time Training (ST-TTT) block, which replaces the standard linear projections in TTT layers with task-specific attention mechanisms, enabling robust adaptation to non-stationary meteorological distributions and thereby significantly enhancing the feature representation of precipitation. Experiments under cross-regional extreme precipitation scenarios demonstrate that REE-TTT substantially outperforms state-of-the-art baseline models in prediction accuracy and generalization, exhibiting remarkable adaptability to data distribution shifts.

[589] Real Time NILM Based Power Monitoring of Identical Induction Motors Representing Cutting Machines in Textile Industry

Md Istiauk Hossain Rifat, Moin Khan, Mohammad Zunaed

Main category: cs.LG

TL;DR: Real-time NILM framework for textile industry using identical motor loads, with hardware setup and new dataset showing aggregate energy estimation works but disaggregation struggles with identical machines.

Details

Motivation: Textile industry in Bangladesh is energy-intensive with outdated monitoring practices, leading to inefficient power usage and high operational costs. Need for real-time monitoring solutions tailored for industrial applications.

Method: Developed NILM-based framework with hardware setup (voltage/current sensors, Arduino Mega, ESP8266) to capture aggregate and individual load data. Created new dataset from three identical induction motors and auxiliary loads (180k+ samples). Evaluated state-of-the-art MATNILM model under industrial conditions.

Result: Aggregate energy estimation was reasonably accurate, but per-appliance disaggregation faced difficulties, especially when multiple identical machines operated simultaneously. Integrated system demonstrated practical real-time monitoring with remote accessibility via Blynk app.

Conclusion: Highlights both potential and limitations of NILM in industrial contexts. Future improvements needed: higher-frequency data collection, larger-scale datasets, and advanced deep learning approaches for handling identical loads.

Abstract: The textile industry in Bangladesh is one of the most energy-intensive sectors, yet its monitoring practices remain largely outdated, resulting in inefficient power usage and high operational costs. To address this, we propose a real-time Non-Intrusive Load Monitoring (NILM)-based framework tailored for industrial applications, with a focus on identical motor-driven loads representing textile cutting machines. A hardware setup comprising voltage and current sensors, Arduino Mega and ESP8266 was developed to capture aggregate and individual load data, which was stored and processed on cloud platforms. A new dataset was created from three identical induction motors and auxiliary loads, totaling over 180,000 samples, to evaluate the state-of-the-art MATNILM model under challenging industrial conditions. Results indicate that while aggregate energy estimation was reasonably accurate, per-appliance disaggregation faced difficulties, particularly when multiple identical machines operated simultaneously. Despite these challenges, the integrated system demonstrated practical real-time monitoring with remote accessibility through the Blynk application. This work highlights both the potential and limitations of NILM in industrial contexts, offering insights into future improvements such as higher-frequency data collection, larger-scale datasets and advanced deep learning approaches for handling identical loads.

[590] Communication-Efficient Federated AUC Maximization with Cyclic Client Participation

Umesh Vangapally, Wenhan Wu, Chen Chen, Zhishuai Guo

Main category: cs.LG

TL;DR: This paper develops communication-efficient algorithms for federated AUC maximization under cyclic client participation, achieving improved complexity bounds for both squared surrogate loss and general pairwise AUC losses.

Details

Motivation: Existing federated AUC maximization methods assume full client availability, which is unrealistic in practice. Real-world FL systems have clients participating cyclically according to fixed schedules, creating unique optimization challenges for the non-decomposable AUC objective.

Method: The paper develops algorithms for two settings: (1) AUC maximization with squared surrogate loss, reformulated as nonconvex-strongly-concave minimax optimization leveraging the Polyak-Łojasiewicz (PL) condition; (2) General pairwise AUC losses with improved complexity under PL condition.

Result: For squared surrogate loss: achieved state-of-the-art communication complexity of $\widetilde{O}(1/ε^{1/2})$ and iteration complexity of $\widetilde{O}(1/ε)$. For general pairwise AUC losses: communication complexity of $O(1/ε^3)$ and iteration complexity of $O(1/ε^4)$, improving to $\widetilde{O}(1/ε^{1/2})$ and $\widetilde{O}(1/ε)$ under PL condition.

Conclusion: The proposed methods demonstrate superior efficiency and effectiveness on benchmark tasks in image classification, medical imaging, and fraud detection, addressing the practical challenge of cyclic client participation in federated AUC maximization.

Abstract: Federated AUC maximization is a powerful approach for learning from imbalanced data in federated learning (FL). However, existing methods typically assume full client availability, which is rarely practical. In real-world FL systems, clients often participate in a cyclic manner: joining training according to a fixed, repeating schedule. This setting poses unique optimization challenges for the non-decomposable AUC objective. This paper addresses these challenges by developing and analyzing communication-efficient algorithms for federated AUC maximization under cyclic client participation. We investigate two key settings: First, we study AUC maximization with a squared surrogate loss, which reformulates the problem as a nonconvex-strongly-concave minimax optimization. By leveraging the Polyak-Łojasiewicz (PL) condition, we establish a state-of-the-art communication complexity of $\widetilde{O}(1/ε^{1/2})$ and iteration complexity of $\widetilde{O}(1/ε)$. Second, we consider general pairwise AUC losses. We establish a communication complexity of $O(1/ε^3)$ and an iteration complexity of $O(1/ε^4)$. Further, under the PL condition, these bounds improve to communication complexity of $\widetilde{O}(1/ε^{1/2})$ and iteration complexity of $\widetilde{O}(1/ε)$. Extensive experiments on benchmark tasks in image classification, medical imaging, and fraud detection demonstrate the superior efficiency and effectiveness of our proposed methods.

[591] Output Embedding Centering for Stable LLM Pretraining

Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg

Main category: cs.LG

TL;DR: Output embedding centering (OEC) addresses training instability in large language models by fixing output logit divergence through geometric analysis of embeddings, outperforming z-loss.

Details

Motivation: Large language model pretraining is expensive and prone to training instabilities, particularly output logit divergence at large learning rates. Current mitigation (z-loss) only addresses symptoms, not the root cause.

Method: Proposes output embedding centering (OEC) based on geometric analysis of output embeddings. Two implementations: deterministic μ-centering operation and regularization-based μ-loss method.

Result: Both OEC variants outperform z-loss in training stability and learning rate sensitivity. They enable convergence at large learning rates where z-loss fails, with μ-loss being less sensitive to hyperparameter tuning.

Conclusion: OEC effectively addresses the root cause of output logit divergence, providing superior training stability compared to z-loss, with practical advantages in convergence and hyperparameter sensitivity.

Abstract: Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs for large learning rates at the end of training is output logit divergence. The most widely used mitigation strategy, z-loss, merely addresses the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings’ geometry and identify its cause. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and prove that it suppresses output logit divergence. OEC can be implemented in two different ways, as a deterministic operation called μ-centering, or a regularization method called μ-loss. Our experiments show that both variants outperform z-loss in terms of training stability and learning rate sensitivity. In particular, they ensure that training converges even for large learning rates when z-loss fails. Furthermore, we find that μ-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.

[592] Length-Aware Adversarial Training for Variable-Length Trajectories: Digital Twins for Mall Shopper Paths

He Sun, Jiwoong Shin, Ravi Dhar

Main category: cs.LG

TL;DR: Proposes length-aware sampling (LAS) for stable training of variable-length trajectory models, improving distribution matching of derived statistics.

Details

Motivation: Standard mini-batch training is unstable for variable-length trajectories due to length heterogeneity, which degrades distribution matching for trajectory-derived statistics needed for simulation and counterfactual analysis.

Method: Length-aware sampling (LAS) groups trajectories by length and samples batches from single length buckets, reducing within-batch heterogeneity. Integrated into conditional trajectory GAN with auxiliary time-alignment losses.

Result: LAS consistently improves matching of derived-variable distributions across diverse datasets (shopper trajectories, GPS, education, e-commerce, movies), outperforming random sampling on dataset-specific metrics.

Conclusion: LAS is a simple but effective batching strategy that improves generative modeling of variable-length trajectories by stabilizing training and enhancing distribution matching without changing model architecture.

Abstract: We study generative modeling of \emph{variable-length trajectories} – sequences of visited locations/items with associated timestamps – for downstream simulation and counterfactual analysis. A recurring practical issue is that standard mini-batch training can be unstable when trajectory lengths are highly heterogeneous, which in turn degrades \emph{distribution matching} for trajectory-derived statistics. We propose \textbf{length-aware sampling (LAS)}, a simple batching strategy that groups trajectories by length and samples batches from a single length bucket, reducing within-batch length heterogeneity (and making updates more consistent) without changing the model class. We integrate LAS into a conditional trajectory GAN with auxiliary time-alignment losses and provide (i) a distribution-level guarantee for derived variables under mild boundedness assumptions, and (ii) an IPM/Wasserstein mechanism explaining why LAS improves distribution matching by removing length-only shortcut critics and targeting within-bucket discrepancies. Empirically, LAS consistently improves matching of derived-variable distributions on a multi-mall dataset of shopper trajectories and on diverse public sequence datasets (GPS, education, e-commerce, and movies), outperforming random sampling across dataset-specific metrics.

[593] Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma

Main category: cs.LG

TL;DR: EAFT uses token-level entropy as a gating mechanism to distinguish epistemic uncertainty from knowledge conflicts during fine-tuning, preventing catastrophic forgetting while maintaining downstream performance.

Details

Motivation: Standard SFT causes catastrophic forgetting by forcing models to fit external supervision that conflicts with their internal beliefs, while RL preserves general capabilities by aligning with internal beliefs. The paper identifies "Confident Conflicts" tokens as the root cause.

Method: Proposes Entropy-Adaptive Fine-Tuning (EAFT) that uses token-level entropy to gate gradient updates: learns from uncertain samples (epistemic uncertainty) but suppresses gradients on conflicting data (low probability, low entropy tokens).

Result: Extensive experiments on Qwen and GLM series (4B to 32B parameters) across mathematical, medical, and agentic domains show EAFT matches SFT’s downstream performance while significantly mitigating degradation of general capabilities.

Conclusion: EAFT effectively addresses the distributional gap between model beliefs and external supervision, preventing catastrophic forgetting by selectively learning from uncertain tokens while avoiding destructive updates on confident conflicts.

Abstract: Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model’s internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as “Confident Conflicts” tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.

[594] Who is the Winning Algorithm? Rank Aggregation for Comparative Studies

Amichai Painsky

Main category: cs.LG

TL;DR: A framework for estimating win probabilities of ML algorithms using complete ranking data instead of just win counts, improving accuracy over existing methods.

Details

Motivation: Standard maximum likelihood approach only uses win counts, ignoring valuable information in complete rankings (2nd, 3rd places, etc.). There's a need to better utilize this complete ranking data to more accurately predict which algorithm will perform best on future datasets.

Method: Introduces a novel conceptual framework that leverages complete rankings (not just win counts) to estimate win probabilities for each algorithm. The framework uses the full ranking information across benchmark datasets to better model algorithm performance.

Result: The proposed framework significantly improves upon currently known methods in both synthetic and real-world examples, demonstrating better accuracy in predicting which algorithm will win on unseen datasets.

Conclusion: Complete ranking data contains valuable information beyond simple win counts, and the proposed framework effectively utilizes this information to provide more accurate win probability estimates for machine learning algorithms.

Abstract: Consider a collection of m competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to ``win’’ (rank highest) on a future, unseen dataset. The standard maximum likelihood approach suggests counting the number of wins per each algorithm. In this work, we argue that there is much more information in the complete rankings. That is, the number of times that each algorithm finished second, third and so forth. Yet, it is not entirely clear how to effectively utilize this information for our purpose. In this work we introduce a novel conceptual framework for estimating the win probability for each of the m algorithms, given their complete rankings over a benchmark of datasets. Our proposed framework significantly improves upon currently known methods in synthetic and real-world examples.

[595] Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

Wei Liu, Yaoxin Wu, Yingqian Zhang, Thomas Bäck, Yingjie Fan

Main category: cs.LG

TL;DR: Proposed a robustness framework for multi-objective combinatorial optimization DRL solvers with adversarial attacks to expose weaknesses and defense via hardness-aware training.

Details

Motivation: DRL shows promise for MOCOPs but robustness of learning-based solvers remains insufficiently explored, especially across diverse problem distributions.

Method: Unified robustness-oriented framework with preference-based adversarial attacks to generate hard instances, and defense strategy integrating hardness-aware preference selection into adversarial training.

Result: Attack method successfully learns hard instances for different solvers; defense method significantly strengthens robustness and generalizability, delivering superior performance on hard/out-of-distribution instances.

Conclusion: The framework effectively exposes solver weaknesses and improves neural solver robustness through adversarial training with hardness-aware preference selection.

Abstract: Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.

[596] HeurekaBench: A Benchmarking Framework for AI Co-scientist

Siba Smarak Panigrahi, Jovana Videnović, Maria Brbić

Main category: cs.LG

TL;DR: HeurekaBench is a framework for creating benchmarks with open-ended research questions grounded in real scientific studies to evaluate LLM-based scientific agents, demonstrated in single-cell biology.

Details

Motivation: Current evaluation of LLM-based scientific agents is challenging due to the need for realistic, end-to-end research scenarios that integrate data analysis, interpretation, and insight generation from experimental data.

Method: Semi-automated pipeline using multiple LLMs to extract insights and generate candidate workflows from scientific studies and code repositories, verified against reported findings. Instantiated in single-cell biology as sc-HeurekaBench.

Result: Addition of critic module improves ill-formed responses for open-source LLM-based agents by up to 22%, closing gap with closed-source counterparts. Benchmark enables quantitative analysis of agent design choices.

Conclusion: HeurekaBench provides rigorous, end-to-end evaluation of scientific agents by grounding benchmark construction in real scientific workflows, setting a path for better assessment of co-scientist systems.

Abstract: LLM-based reasoning models have enabled the development of agentic systems that act as co-scientists, assisting in multi-step scientific analysis. However, evaluating these systems is challenging, as it requires realistic, end-to-end research scenarios that integrate data analysis, interpretation, and the generation of new insights from the experimental data. To address this limitation, we introduce HeurekaBench, a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets. Each such question is grounded in a scientific study and its corresponding code repository, and is created using a semi-automated pipeline that leverages multiple LLMs to extract insights and generate candidate workflows, which are then verified against reported findings. We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to compare state-of-the-art single-cell agents. We further showcase the benefits of our benchmark for quantitatively analyzing current design choices in agentic systems. We find that the addition of a critic module can improve ill-formed responses for open-source LLM-based agents by up to 22% and close the gap with their closed-source counterparts. Overall, HeurekaBench sets a path toward rigorous, end-to-end evaluation of scientific agents, grounding benchmark construction in real scientific workflows.

[597] Digital Twin-Driven Communication-Efficient Federated Anomaly Detection for Industrial IoT

Mohammed Ayalew Belay, Adil Rasheed, Pierluigi Salvo Rossi

Main category: cs.LG

TL;DR: Proposes digital twin-integrated federated learning methods for IIoT anomaly detection that combine synthetic and real-world knowledge to improve privacy, communication efficiency, and convergence speed.

Details

Motivation: Existing anomaly detection methods face challenges including dependence on real sensor data only, limited labeled data, high false alarm rates, and privacy concerns in industrial systems.

Method: Five novel DTFL approaches: Digital Twin-Based Meta-Learning (DTML), Federated Parameter Fusion (FPF), Layer-wise Parameter Exchange (LPE), Cyclic Weight Adaptation (CWA), and Digital Twin Knowledge Distillation (DTKD). These methods integrate synthetic knowledge from digital twins with real-world data in federated learning framework.

Result: For 80% target accuracy, CWA reached target in 33 rounds, FPF in 41 rounds, LPE in 48 rounds, DTML in 87 rounds, while standard FedAvg and DTKD didn’t reach target within 100 rounds. CWA achieved up to 62% fewer rounds than DTML and 31% fewer than LPE.

Conclusion: Integrating digital twin knowledge into federated learning accelerates convergence to operationally meaningful accuracy thresholds for IIoT anomaly detection while preserving data privacy and improving communication efficiency.

Abstract: Anomaly detection is increasingly becoming crucial for maintaining the safety, reliability, and efficiency of industrial systems. Recently, with the advent of digital twins and data-driven decision-making, several statistical and machine-learning methods have been proposed. However, these methods face several challenges, such as dependence on only real sensor datasets, limited labeled data, high false alarm rates, and privacy concerns. To address these problems, we propose a suite of digital twin-integrated federated learning (DTFL) methods that enhance global model performance while preserving data privacy and communication efficiency. Specifically, we present five novel approaches: Digital Twin-Based Meta-Learning (DTML), Federated Parameter Fusion (FPF), Layer-wise Parameter Exchange (LPE), Cyclic Weight Adaptation (CWA), and Digital Twin Knowledge Distillation (DTKD). Each method introduces a unique mechanism to combine synthetic and real-world knowledge, balancing generalization with communication overhead. We conduct an extensive experiment using a publicly available cyber-physical anomaly detection dataset. For a target accuracy of 80%, CWA reaches the target in 33 rounds, FPF in 41 rounds, LPE in 48 rounds, and DTML in 87 rounds, whereas the standard FedAvg baseline and DTKD do not reach the target within 100 rounds. These results highlight substantial communication-efficiency gains (up to 62% fewer rounds than DTML and 31% fewer than LPE) and demonstrate that integrating DT knowledge into FL accelerates convergence to operationally meaningful accuracy thresholds for IIoT anomaly detection.

[598] DiMEx: Breaking the Cold Start Barrier in Data-Free Model Extraction via Latent Diffusion Priors

Yash Thesia, Meera Suthar

Main category: cs.LG

TL;DR: DiMEx uses Latent Diffusion Models and Bayesian Optimization to perform efficient data-free model extraction, while HSE defense detects latent-space attacks via their optimization trajectory.

Details

Motivation: Model stealing attacks threaten MLaaS by allowing replication of proprietary models cheaply. Current DFME methods suffer from "Cold Start" problem where GAN-based approaches waste queries converging from random noise.

Method: DiMEx weaponizes pre-trained Latent Diffusion Models’ semantic priors and uses Random Embedding Bayesian Optimization (REMBO) in the latent space to synthesize high-fidelity queries immediately. HSE defense identifies the unique “optimization trajectory” of latent-space attacks.

Result: DiMEx achieves 52.1% agreement on SVHN with just 2,000 queries, outperforming GAN baselines by over 16%. HSE suppresses attack success rates to 21.6% with negligible latency while DiMEx evades static distribution detectors.

Conclusion: Latent Diffusion Models enable efficient model extraction bypassing cold start, but their temporal optimization signatures can be exploited for defense via ensemble methods that track attack trajectories.

Abstract: Model stealing attacks pose an existential threat to Machine Learning as a Service (MLaaS), allowing adversaries to replicate proprietary models for a fraction of their training cost. While Data-Free Model Extraction (DFME) has emerged as a stealthy vector, it remains fundamentally constrained by the “Cold Start” problem: GAN-based adversaries waste thousands of queries converging from random noise to meaningful data. We propose DiMEx, a framework that weaponizes the rich semantic priors of pre-trained Latent Diffusion Models to bypass this initialization barrier entirely. By employing Random Embedding Bayesian Optimization (REMBO) within the generator’s latent space, DiMEx synthesizes high-fidelity queries immediately, achieving 52.1 percent agreement on SVHN with just 2,000 queries - outperforming state-of-the-art GAN baselines by over 16 percent. To counter this highly semantic threat, we introduce the Hybrid Stateful Ensemble (HSE) defense, which identifies the unique “optimization trajectory” of latent-space attacks. Our results demonstrate that while DiMEx evades static distribution detectors, HSE exploits this temporal signature to suppress attack success rates to 21.6 percent with negligible latency.

[599] Enhanced Multi-model Online Conformal Prediction

Erfan Hajihashemi, Yanning Shen

Main category: cs.LG

TL;DR: Novel multi-model online conformal prediction algorithm that uses bipartite graphs to select effective models, reducing computational complexity while improving prediction set efficiency.

Details

Motivation: Traditional conformal prediction with a single fixed model leads to suboptimal performance in online environments, as no single model performs consistently well across all time steps. Existing multi-model approaches become computationally expensive with many candidates, and poorly performing models can hinder effectiveness.

Method: Develops a multi-model online conformal prediction algorithm that generates a bipartite graph at each time step to identify a subset of effective models, then selects one model from this subset to construct the prediction set.

Result: Experiments show the method outperforms existing multi-model conformal prediction techniques in both prediction set size (efficiency) and computational efficiency.

Conclusion: The proposed bipartite graph-based model selection approach successfully addresses computational complexity and prediction efficiency challenges in multi-model online conformal prediction.

Abstract: Conformal prediction is a framework for uncertainty quantification that constructs prediction sets for previously unseen data, guaranteeing coverage of the true label with a specified probability. However, the efficiency of these prediction sets, measured by their size, depends on the choice of the underlying learning model. Relying on a single fixed model may lead to suboptimal performance in online environments, as a single model may not consistently perform well across all time steps. To mitigate this, prior work has explored selecting a model from a set of candidates. However, this approach becomes computationally expensive as the number of candidate models increases. Moreover, poorly performing models in the set may also hinder the effectiveness. To tackle this challenge, this work develops a novel multi-model online conformal prediction algorithm that reduces computational complexity and improves prediction efficiency. At each time step, a bipartite graph is generated to identify a subset of effective models, from which a model is selected to construct the prediction set. Experiments demonstrate that our method outperforms existing multi-model conformal prediction techniques in terms of both prediction set size and computational efficiency.

[600] UnPII: Unlearning Personally Identifiable Information with Quantifiable Exposure Risk

Intae Jeon, Yujeong Kwon, Hyungjoon Koo

Main category: cs.LG

TL;DR: UnPII is a PII-centric machine unlearning approach that prioritizes forgetting based on individual PII attribute risks using a composite PII Risk Index, achieving better accuracy, utility, and generalizability with modest overhead.

Details

Motivation: Growing privacy concerns with LLMs handling sensitive PII in critical sectors, combined with regulations like GDPR requiring PII deletion, create a need for reliable and cost-effective data removal solutions that account for varying privacy risks of different PII attributes.

Method: Proposes UnPII with a PII Risk Index (PRI) that evaluates multiple risk dimensions (identifiability, sensitivity, usability, linkability, permanency, exposability, compliancy). Uses synthetic PII dataset with realistic exposure scenarios, integrates with existing unlearning algorithms (Gradient Ascent, Negative Preference Optimization, Direct Preference Optimization) without modifying their core principles.

Result: UnPII achieves improvements of accuracy up to 11.8%, utility up to 6.3%, and generalizability up to 12.4%, while incurring modest fine-tuning overhead of 27.5% on average during unlearning.

Conclusion: UnPII provides a practical, risk-aware unlearning solution for PII removal that outperforms uniform forgetting strategies, offering organizations a way to comply with privacy regulations while maintaining model performance.

Abstract: The ever-increasing adoption of Large Language Models in critical sectors like finance, healthcare, and government raises privacy concerns regarding the handling of sensitive Personally Identifiable Information (PII) during training. In response, regulations such as European Union’s General Data Protection Regulation (GDPR) mandate the deletion of PII upon requests, underscoring the need for reliable and cost-effective data removal solutions. Machine unlearning has emerged as a promising direction for selectively forgetting data points. However, existing unlearning techniques typically apply a uniform forgetting strategy that neither accounts for the varying privacy risks posed by different PII attributes nor reflects associated business risks. In this work, we propose UnPII, the first PII-centric unlearning approach that prioritizes forgetting based on the risk of individual or combined PII attributes. To this end, we introduce the PII risk index (PRI), a composite metric that incorporates multiple dimensions of risk factors: identifiability, sensitivity, usability, linkability, permanency, exposability, and compliancy. The PRI enables a nuanced evaluation of privacy risks associated with PII exposures and can be tailored to align with organizational privacy policies. To support realistic assessment, we systematically construct a synthetic PII dataset (e.g., 1,700 PII instances) that simulates realistic exposure scenarios. UnPII seamlessly integrates with established unlearning algorithms, such as Gradient Ascent, Negative Preference Optimization, and Direct Preference Optimization, without modifying their underlying principles. Our experimental results demonstrate that UnPII achieves the improvements of accuracy up to 11.8%, utility up to 6.3%, and generalizability up to 12.4%, respectively, while incurring a modest fine-tuning overhead of 27.5% on average during unlearning.

[601] Distributed Federated Learning by Alternating Periods of Training

Shamik Bhattacharyya, Rachel Kalpana Kalaimani

Main category: cs.LG

TL;DR: This paper proposes a distributed federated learning (DFL) approach with multiple servers to address scalability and fault-tolerance limitations of traditional single-server federated learning.

Details

Motivation: Traditional federated learning relies on a single central server, which creates scalability issues with large numbers of clients and poses a single point of failure risk. The paper aims to address these limitations of scalability and fault-tolerance.

Method: The authors design a distributed federated learning framework with multiple servers that have inter-server communication capabilities. Each server is associated with a disjoint set of clients. They propose a DFL algorithm that uses alternating periods of local training on client data followed by global training among servers.

Result: The DFL algorithm, with appropriate parameter choices, ensures all servers converge to a common model value within a small tolerance of the ideal model, effectively integrating local and global training. Theoretical claims are supported by numerical simulations.

Conclusion: The proposed distributed federated learning approach successfully addresses scalability and fault-tolerance limitations of traditional federated learning while maintaining the core federated learning structure and achieving convergence to a common model across all servers.

Abstract: Federated learning is a privacy-focused approach towards machine learning where models are trained on client devices with locally available data and aggregated at a central server. However, the dependence on a single central server is challenging in the case of a large number of clients and even poses the risk of a single point of failure. To address these critical limitations of scalability and fault-tolerance, we present a distributed approach to federated learning comprising multiple servers with inter-server communication capabilities. While providing a fully decentralized approach, the designed framework retains the core federated learning structure where each server is associated with a disjoint set of clients with server-client communication capabilities. We propose a novel DFL (Distributed Federated Learning) algorithm which uses alternating periods of local training on the client data followed by global training among servers. We show that the DFL algorithm, under a suitable choice of parameters, ensures that all the servers converge to a common model value within a small tolerance of the ideal model, thus exhibiting effective integration of local and global training models. Finally, we illustrate our theoretical claims through numerical simulations.

[602] Sparse Threats, Focused Defense: Criticality-Aware Robust Reinforcement Learning for Safe Autonomous Driving

Qi Wei, Junchao Fan, Zhao Yang, Jianhua Wang, Jingkai Mao, Xiaolin Chang

Main category: cs.LG

TL;DR: CARRL introduces criticality-aware robust RL for autonomous driving, using a general-sum game with risk-focused adversarial training to handle sparse safety-critical risks, reducing collision rates by at least 22.66%.

Details

Motivation: Current RL approaches for autonomous driving are vulnerable to perturbations, and existing adversarial training methods fail to address the inherent asymmetry between agent and adversary and the sparsity of safety-critical risks, making them inadequate for practical deployment.

Method: CARRL consists of two components: Risk Exposure Adversary (REA) that focuses on exposing safety-critical failures using decoupled optimization, and Risk-Targeted Robust Agent (RTRA) that learns to balance safety with efficiency using a dual replay buffer and policy consistency enforcement.

Result: The approach reduces collision rates by at least 22.66% across all cases compared to state-of-the-art baseline methods, demonstrating superior robustness in handling sparse safety-critical risks.

Conclusion: CARRL effectively addresses the limitations of existing adversarial training methods for autonomous driving by modeling the interaction as a general-sum game and focusing on sparse safety-critical moments, leading to significantly improved robustness and safety.

Abstract: Reinforcement learning (RL) has shown considerable potential in autonomous driving (AD), yet its vulnerability to perturbations remains a critical barrier to real-world deployment. As a primary countermeasure, adversarial training improves policy robustness by training the AD agent in the presence of an adversary that deliberately introduces perturbations. Existing approaches typically model the interaction as a zero-sum game with continuous attacks. However, such designs overlook the inherent asymmetry between the agent and the adversary and then fail to reflect the sparsity of safety-critical risks, rendering the achieved robustness inadequate for practical AD scenarios. To address these limitations, we introduce criticality-aware robust RL (CARRL), a novel adversarial training approach for handling sparse, safety-critical risks in autonomous driving. CARRL consists of two interacting components: a risk exposure adversary (REA) and a risk-targeted robust agent (RTRA). We model the interaction between the REA and RTRA as a general-sum game, allowing the REA to focus on exposing safety-critical failures (e.g., collisions) while the RTRA learns to balance safety with driving efficiency. The REA employs a decoupled optimization mechanism to better identify and exploit sparse safety-critical moments under a constrained budget. However, such focused attacks inevitably result in a scarcity of adversarial data. The RTRA copes with this scarcity by jointly leveraging benign and adversarial experiences via a dual replay buffer and enforces policy consistency under perturbations to stabilize behavior. Experimental results demonstrate that our approach reduces the collision rate by at least 22.66% across all cases compared to state-of-the-art baseline methods.

[603] Moments Matter:Stabilizing Policy Optimization using Return Distributions

Dennis Jabs, Aditya Mohan, Marius Lindauer

Main category: cs.LG

TL;DR: The paper proposes using higher-order moments (skewness and kurtosis) of distributional critic outputs to bias PPO’s advantage function, reducing policy instability caused by noisy updates while maintaining performance.

Details

Motivation: Deep RL agents often learn policies with the same episodic return but different behaviors due to environmental and algorithmic noise. In continuous control, small parameter shifts can cause unstable gaits, complicating algorithm comparison and real-world transfer. Previous work showed that the spread of post-update return distribution R(θ) indicates noise-induced instability, but directly estimating R(θ) is computationally expensive in high-dimensional settings.

Method: The method models state-action return distribution through a distributional critic and biases PPO’s advantage function using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, the approach discourages policies from entering parameter regimes prone to instability. This leverages environmental stochasticity to mitigate update-induced variability.

Result: The moment-based correction narrows the post-update return distribution R(θ), improving stability by up to 75% in Walker2D environment while preserving comparable evaluation returns. The method addresses cases where post-update critic values align poorly with post-update returns, which causes standard PPO to struggle with producing a narrow R(θ).

Conclusion: Using higher-order moments of distributional critic outputs to bias PPO’s advantage function effectively reduces policy instability in continuous control tasks. This approach provides a computationally efficient alternative to directly constraining R(θ) while maintaining performance, addressing the challenge of noisy policy updates in deep reinforcement learning.

Abstract: Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution $R(θ)$, obtained by repeatedly sampling minibatches, updating $θ$, and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow $R(θ)$ can improve stability, directly estimating $R(θ)$ is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model state-action return distribution through a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow $R(θ)$. In such cases, our moment-based correction narrows $R(θ)$, improving stability by up to 75% in Walker2D, while preserving comparable evaluation returns.

[604] RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

Peiyan Hu, Haodong Feng, Hongyuan Liu, Tongtong Yan, Wenhao Deng, Tianrun Gao, Rong Zheng, Haoren Zheng, Chenglei Yu, Chuanrui Wang, Kaiwen Li, Zhi-Ming Ma, Dezhi Zhou, Xingcai Lu, Dixia Fan, Tailin Wu

Main category: cs.LG

TL;DR: RealPDEBench is the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations across five datasets, three tasks, eight metrics, and ten baselines to address the sim-to-real gap.

Details

Motivation: Current scientific ML models are trained on simulated data due to lack of expensive real-world data, limiting development, evaluation, and sim-to-real transfer research.

Method: Introduces RealPDEBench with five real-world measured datasets paired with simulations, defines three comparison tasks, designs eight evaluation metrics (data-oriented and physics-oriented), and benchmarks ten representative baselines.

Result: Experiments show significant discrepancies between simulated and real-world data, while pretraining with simulated data consistently improves both accuracy and convergence.

Conclusion: The benchmark provides insights from real-world data to advance scientific ML toward bridging the sim-to-real gap and enabling real-world deployment.

Abstract: Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real-world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim-to-real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines. We first present five real-world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real-world and simulated data, and facilitate the development of methods to bridge the two. Moreover, we design eight evaluation metrics, spanning data-oriented and physics-oriented metrics, and finally benchmark ten representative baselines, including state-of-the-art models, pretrained PDE foundation models, and a traditional method. Experiments reveal significant discrepancies between simulated and real-world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence. In this work, we hope to provide insights from real-world data, advancing scientific ML toward bridging the sim-to-real gap and real-world deployment. Our benchmark, datasets, and instructions are available at https://realpdebench.github.io/.

[605] FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks

Chenyu Hu, Qiming Hu, Sinan Chen, Nianyu Li, Mingyue Zhang, Jialong Li

Main category: cs.LG

TL;DR: FAROS is an enhanced FL framework with adaptive defenses against backdoor attacks using ADS and RCC mechanisms.

Details

Motivation: Existing FL defenses against backdoor attacks rely on fixed parameters, making them vulnerable to sophisticated attackers and single-point-of-failure risks.

Method: Proposes FAROS framework with Adaptive Differential Scaling (ADS) that dynamically adjusts defense sensitivity based on gradient dispersion, and Robust Core-set Computing (RCC) that computes centroids from high-confidence clients to mitigate single-point failure.

Result: Extensive experiments show FAROS outperforms current defenses in both attack success rate reduction and main task accuracy preservation across various datasets, models, and attack scenarios.

Conclusion: FAROS provides robust defense against sophisticated backdoor attacks in FL through adaptive mechanisms that address limitations of fixed-parameter approaches.

Abstract: Federated Learning (FL) enables multiple clients to collaboratively train a shared model without exposing local data. However, backdoor attacks pose a significant threat to FL. These attacks aim to implant a stealthy trigger into the global model, causing it to mislead on inputs that possess a specific trigger while functioning normally on benign data. Although pre-aggregation detection is a main defense direction, existing state-of-the-art defenses often rely on fixed defense parameters. This reliance makes them vulnerable to single-point-of-failure risks, rendering them less effective against sophisticated attackers. To address these limitations, we propose FAROS, an enhanced FL framework that incorporates Adaptive Differential Scaling (ADS) and Robust Core-set Computing (RCC). The ADS mechanism adjusts the defense’s sensitivity dynamically, based on the dispersion of uploaded gradients by clients in each round. This allows it to counter attackers who strategically shift between stealthiness and effectiveness. Furthermore, the RCC effectively mitigates the risk of single-point failure by computing the centroid of a core set comprising clients with the highest confidence. We conducted extensive experiments across various datasets, models, and attack scenarios. The results demonstrate that our method outperforms current defenses in both attack success rate and main task accuracy.

[606] Tackling Resource-Constrained and Data-Heterogeneity in Federated Learning with Double-Weight Sparse Pack

Qiantao Yang, Liquan Chen, Mingfu Xue, Songze Li

Main category: cs.LG

TL;DR: FedCSPACK: A personalized federated learning method using cosine sparsification parameter packing and dual-weighted aggregation to address data heterogeneity while accommodating limited client resources.

Details

Motivation: Existing federated learning methods struggle to balance addressing data heterogeneity with accommodating limited client resources (insufficient communication bandwidth and computing power). Current approaches that split models and use knowledge distillation neglect these resource constraints.

Method: FedCSPACK uses cosine sparsification parameter packing where clients package model parameters and select the most contributing parameter packages based on cosine similarity to reduce bandwidth. Clients generate mask matrices anchored to shared parameter packages to improve sparse update alignment and aggregation efficiency. The method incorporates directional and distribution distance weights in the mask for weighted-guided aggregation.

Result: Extensive experiments across four datasets using ten state-of-the-art methods show that FedCSPACK effectively improves communication and computational efficiency while maintaining high model accuracy.

Conclusion: FedCSPACK successfully addresses the dual challenge of data heterogeneity and limited client resources in federated learning by combining parameter packing, sparsification, and dual-weighted aggregation mechanisms.

Abstract: Federated learning has drawn widespread interest from researchers, yet the data heterogeneity across edge clients remains a key challenge, often degrading model performance. Existing methods enhance model compatibility with data heterogeneity by splitting models and knowledge distillation. However, they neglect the insufficient communication bandwidth and computing power on the client, failing to strike an effective balance between addressing data heterogeneity and accommodating limited client resources. To tackle this limitation, we propose a personalized federated learning method based on cosine sparsification parameter packing and dual-weighted aggregation (FedCSPACK), which effectively leverages the limited client resources and reduces the impact of data heterogeneity on model performance. In FedCSPACK, the client packages model parameters and selects the most contributing parameter packages for sharing based on cosine similarity, effectively reducing bandwidth requirements. The client then generates a mask matrix anchored to the shared parameter package to improve the alignment and aggregation efficiency of sparse updates on the server. Furthermore, directional and distribution distance weights are embedded in the mask to implement a weighted-guided aggregation mechanism, enhancing the robustness and generalization performance of the global model. Extensive experiments across four datasets using ten state-of-the-art methods demonstrate that FedCSPACK effectively improves communication and computational efficiency while maintaining high model accuracy.

[607] High-Order Epistasis Detection Using Factorization Machine with Quadratic Optimization Annealing and MDR-Based Evaluation

Shuta Kikuchi, Shu Tanaka

Main category: cs.LG

TL;DR: FMQA-based method efficiently detects high-order epistasis by treating it as black-box optimization, outperforming exhaustive MDR searches in computational efficiency.

Details

Motivation: Exhaustive MDR searches become computationally infeasible for high-order epistasis detection due to combinatorial explosion of candidate locus combinations, necessitating more efficient methods.

Method: Formulate epistasis detection as black-box optimization problem solved with factorization machine with quadratic optimization annealing (FMQA), using MDR’s classification error rate as objective function.

Result: Method successfully identified ground-truth epistasis across various interaction orders and numbers of genetic loci within limited iterations, demonstrating effectiveness and computational efficiency.

Conclusion: Proposed FMQA-based approach provides effective and computationally efficient solution for high-order epistasis detection, overcoming limitations of exhaustive MDR searches.

Abstract: Detecting high-order epistasis is a fundamental challenge in genetic association studies due to the combinatorial explosion of candidate locus combinations. Although multifactor dimensionality reduction (MDR) is a widely used method for evaluating epistasis, exhaustive MDR-based searches become computationally infeasible as the number of loci or the interaction order increases. In this paper, we define the epistasis detection problem as a black-box optimization problem and solve it with a factorization machine with quadratic optimization annealing (FMQA). We propose an efficient epistasis detection method based on FMQA, in which the classification error rate (CER) computed by MDR is used as a black-box objective function. Experimental evaluations were conducted using simulated case-control datasets with predefined high-order epistasis. The results demonstrate that the proposed method successfully identified ground-truth epistasis across various interaction orders and the numbers of genetic loci within a limited number of iterations. These results indicate that the proposed method is effective and computationally efficient for high-order epistasis detection.

[608] Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia

Main category: cs.LG

TL;DR: Safety alignment in LLMs can be fully recovered with just one safety example, without utility degradation, regardless of model size or harmful examples used in fine-tuning.

Details

Motivation: Fine-tuning safety-aligned LLMs often compromises their safety, and previous approaches require many safety samples or calibration sets, leading to significant computational overhead and utility degradation.

Method: The paper shows that safety alignment can be recovered with only a single safety example, without sacrificing utility. The approach works regardless of the number of harmful examples used in fine-tuning or model size, and converges within few epochs. The authors uncover the low-rank structure of the safety gradient to explain why such efficient correction is possible.

Result: The method is validated across five safety-aligned LLMs and multiple datasets, demonstrating the generality of the approach. Safety recovery is effective and efficient with minimal computational cost.

Conclusion: Contrary to previous beliefs, safety alignment in LLMs can be efficiently recovered with minimal data (just one safety example) without compromising utility, due to the low-rank structure of safety gradients.

Abstract: Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.

[609] FedBiCross: A Bi-Level Optimization Framework to Tackle Non-IID Challenges in Data-Free One-Shot Federated Learning on Medical Data

Yuexuan Xia, Yinghao Zhang, Yalin Liu, Hong-Ning Dai, Yong Xia

Main category: cs.LG

TL;DR: FedBiCross is a personalized one-shot federated learning framework that addresses the problem of conflicting predictions in non-IID medical data by clustering clients, using bi-level cross-cluster optimization, and personalized distillation.

Details

Motivation: Existing one-shot federated learning methods aggregate predictions from all clients to form a global teacher, but under non-IID data, conflicting predictions cancel out during averaging, yielding near-uniform soft labels that provide weak supervision for distillation.

Method: Three-stage framework: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation.

Result: Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

Conclusion: FedBiCross effectively addresses the limitations of existing OSFL methods in non-IID medical data scenarios through personalized clustering and selective knowledge transfer.

Abstract: Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions cancel out during averaging, yielding near-uniform soft labels that provide weak supervision for distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

[610] Evaluating Feature Dependent Noise in Preference-based Reinforcement Learning

Yuxuan Li, Harshith Reddy Kethireddy, Srijita Das

Main category: cs.LG

TL;DR: The paper formalizes targeted feature-dependent noise in preference-based RL, showing that current noise-robust methods fail against such noise, while basic methods surprisingly outperform them in most settings.

Details

Motivation: Existing PbRL methods assume uniform noise, but real-world preferences often have feature-dependent noise correlated with observations. Current noise-robust methods don't handle this realistic noise pattern effectively.

Method: Formalized feature-dependent noise concept with variants: trajectory feature noise, trajectory similarity noise, uncertainty-aware noise, and Language Model noise. Evaluated on DMControl and Meta-world continuous control tasks.

Result: State-of-the-art noise-robust PbRL methods significantly deteriorate under feature-dependent noise, while basic PbRL without explicit denoising surprisingly outperforms them in most settings. Language model noise exhibits similar characteristics to feature-dependent noise.

Conclusion: Feature-dependent noise is a realistic challenge that current noise-robust PbRL methods fail to address. Basic methods are surprisingly more robust, and language models can simulate realistic human noise patterns, calling for new robust learning approaches.

Abstract: Learning from Preferences in Reinforcement Learning (PbRL) has gained attention recently, as it serves as a natural fit for complicated tasks where the reward function is not easily available. However, preferences often come with uncertainty and noise if they are not from perfect teachers. Much prior literature aimed to detect noise, but with limited types of noise and most being uniformly distributed with no connection to observations. In this work, we formalize the notion of targeted feature-dependent noise and propose several variants like trajectory feature noise, trajectory similarity noise, uncertainty-aware noise, and Language Model noise. We evaluate feature-dependent noise, where noise is correlated with certain features in complex continuous control tasks from DMControl and Meta-world. Our experiments show that in some feature-dependent noise settings, the state-of-the-art noise-robust PbRL method’s learning performance is significantly deteriorated, while PbRL method with no explicit denoising can surprisingly outperform noise-robust PbRL in majority settings. We also find language model’s noise exhibits similar characteristics to feature-dependent noise, thereby simulating realistic humans and call for further study in learning with feature-dependent noise robustly.

[611] TT-FSI: Scalable Faithful Shapley Interactions via Tensor-Train

Ungsik Kim, Suwon Lee

Main category: cs.LG

TL;DR: TT-FSI is an efficient algorithm that computes the Faithful Shapley Interaction index using Matrix Product Operators, achieving exponential improvements in time and memory over existing methods.

Details

Motivation: The Faithful Shapley Interaction (FSI) index is theoretically desirable but computationally expensive, requiring O(d^ℓ·2^d) time and O(4^d) memory, making it impractical for high-dimensional problems.

Method: TT-FSI exploits FSI’s algebraic structure using Matrix Product Operators (MPO), proving that the linear operator v ↦ FSI(v) admits an MPO representation with TT-rank O(ℓd), enabling an efficient sweep algorithm with O(ℓ²d³·2^d) time and O(ℓd²) core storage.

Result: Experiments on six datasets (d=8 to d=20) show up to 280× speedup over baseline, 85× over SHAP-IQ, and 290× memory reduction. TT-FSI scales to d=20 (1M coalitions) where all competing methods fail.

Conclusion: TT-FSI enables practical computation of the theoretically desirable FSI index for high-dimensional problems by leveraging tensor network techniques to achieve exponential improvements in computational efficiency.

Abstract: The Faithful Shapley Interaction (FSI) index uniquely satisfies the faithfulness axiom among Shapley interaction indices, but computing FSI requires $O(d^\ell \cdot 2^d)$ time and existing implementations use $O(4^d)$ memory. We present TT-FSI, which exploits FSI’s algebraic structure via Matrix Product Operators (MPO). Our main theoretical contribution is proving that the linear operator $v \mapsto \text{FSI}(v)$ admits an MPO representation with TT-rank $O(\ell d)$, enabling an efficient sweep algorithm with $O(\ell^2 d^3 \cdot 2^d)$ time and $O(\ell d^2)$ core storage an exponential improvement over existing methods. Experiments on six datasets ($d=8$ to $d=20$) demonstrate up to 280$\times$ speedup over baseline, 85$\times$ over SHAP-IQ, and 290$\times$ memory reduction. TT-FSI scales to $d=20$ (1M coalitions) where all competing methods fail.

[612] Distorted Distributional Policy Evaluation for Offline Reinforcement Learning

Ryo Iwaki, Takayuki Osogami

Main category: cs.LG

TL;DR: Offline DRL methods suffer from uniform pessimism in return quantile estimation, leading to overly conservative value estimates. The paper introduces quantile distortion to enable non-uniform pessimism based on data availability.

Details

Motivation: Distributional RL works well online but struggles offline. Existing offline DRL methods uniformly underestimate return quantiles, causing overly conservative value estimates that hinder generalization and performance.

Method: Introduces quantile distortion concept to enable non-uniform pessimism by adjusting conservatism degree based on supporting data availability. The approach is theoretically grounded.

Result: Empirically validated approach demonstrates improved performance over uniform pessimism methods in offline DRL scenarios.

Conclusion: Quantile distortion addresses limitations of uniform pessimism in offline DRL, enabling better generalization and performance through data-adaptive conservatism.

Abstract: While Distributional Reinforcement Learning (DRL) methods have demonstrated strong performance in online settings, its success in offline scenarios remains limited. We hypothesize that a key limitation of existing offline DRL methods lies in their approach to uniformly underestimate return quantiles. This uniform pessimism can lead to overly conservative value estimates, ultimately hindering generalization and performance. To address this, we introduce a novel concept called quantile distortion, which enables non-uniform pessimism by adjusting the degree of conservatism based on the availability of supporting data. Our approach is grounded in theoretical analysis and empirically validated, demonstrating improved performance over uniform pessimism.

[613] Theoretical Convergence of SMOTE-Generated Samples

Firuz Kamalov, Hana Sulieman, Witold Pedrycz

Main category: cs.LG

TL;DR: Theoretical proof that SMOTE’s synthetic data converges to the original distribution, with faster convergence using fewer nearest neighbors.

Details

Motivation: SMOTE is widely used for imbalanced data but lacks rigorous theoretical validation. The paper aims to provide foundational theoretical understanding of SMOTE's convergence properties to enhance data augmentation techniques.

Method: Rigorous mathematical analysis proving convergence properties: 1) synthetic variable Z converges in probability to original variable X, 2) stronger convergence in mean when X is compact, 3) analysis of convergence speed relative to nearest neighbor rank. Supported by numerical experiments with real and synthetic data.

Result: Proved theoretical convergence guarantees for SMOTE: probability convergence and mean convergence under compactness. Showed lower nearest neighbor rank leads to faster convergence, providing practical guidance for parameter selection.

Conclusion: The work establishes foundational theoretical understanding of SMOTE, validating its effectiveness and providing actionable insights for practitioners. The analysis extends beyond imbalanced data to enhance general data augmentation techniques.

Abstract: Imbalanced data affects a wide range of machine learning applications, from healthcare to network security. As SMOTE is one of the most popular approaches to addressing this issue, it is imperative to validate it not only empirically but also theoretically. In this paper, we provide a rigorous theoretical analysis of SMOTE’s convergence properties. Concretely, we prove that the synthetic random variable Z converges in probability to the underlying random variable X. We further prove a stronger convergence in mean when X is compact. Finally, we show that lower values of the nearest neighbor rank lead to faster convergence offering actionable guidance to practitioners. The theoretical results are supported by numerical experiments using both real-life and synthetic data. Our work provides a foundational understanding that enhances data augmentation techniques beyond imbalanced data scenarios.

[614] DéjàQ: Open-Ended Evolution of Diverse, Learnable and Verifiable Problems

Willem Röpke, Samuel Coward, Andrei Lupu, Thomas Foster, Tim Rocktäschel, Jakob Foerster

Main category: cs.LG

TL;DR: DéjàQ is a framework that evolves synthetic math problems alongside model training using LLM-driven mutation strategies to improve reasoning capabilities.

Details

Motivation: Current reasoning models rely on static datasets that encourage memorization and limit generalization. The authors aim to create a dynamic training approach that adapts to model capabilities throughout training.

Method: Joint evolution of diverse synthetic mathematical problems with model training using two LLM-driven mutation strategies: altering contextual details or directly modifying problem structure. The model itself mutates the training data in an evolutionary process.

Result: The model can generate novel and meaningful problems, and LLM-driven mutations improve RL training. The framework demonstrates potential for enhancing mathematical reasoning with manageable computational overhead.

Conclusion: Dynamically evolving training data enhances mathematical reasoning and has broader applicability. The authors will open-source their code to support further research.

Abstract: Recent advances in reasoning models have yielded impressive results in mathematics and coding. However, most approaches rely on static datasets, which have been suggested to encourage memorisation and limit generalisation. We introduce DéjàQ, a framework that departs from this paradigm by jointly evolving a diverse set of synthetic mathematical problems alongside model training. This evolutionary process adapts to the model’s ability throughout training, optimising problems for learnability. We propose two LLM-driven mutation strategies in which the model itself mutates the training data, either by altering contextual details or by directly modifying problem structure. We find that the model can generate novel and meaningful problems, and that these LLM-driven mutations improve RL training. We analyse key aspects of DéjàQ, including the validity of generated problems and computational overhead. Our results underscore the potential of dynamically evolving training data to enhance mathematical reasoning and indicate broader applicability, which we will support by open-sourcing our code.

[615] SynRXN: An Open Benchmark and Curated Dataset for Computational Reaction Modeling

Tieu-Long Phan, Nhu-Ngoc Nguyen Song, Peter F. Stadler

Main category: cs.LG

TL;DR: SynRXN is a unified benchmarking framework and open-data resource for computer-aided synthesis planning that decomposes synthesis planning into five task families with standardized datasets, evaluation workflows, and reproducible build processes.

Details

Motivation: To address dataset heterogeneity and enable fair comparison of CASP methods by providing standardized, leakage-aware evaluation frameworks with transparent data provenance and reproducible workflows.

Method: Decomposes synthesis planning into five task families (reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, synthesis route design), curates reaction corpora from public sources into harmonized representations with metadata, provides transparent splitting functions for leakage-aware partitions, and includes standardized evaluation workflows with tailored metrics.

Result: Creates a comprehensive open-data resource with versioned datasets, machine-readable manifests, reproducible build recipes, and sensitive benchmarking approach that combines public training/validation data with held-out gold-standard test sets.

Conclusion: SynRXN enables fair longitudinal comparison of CASP methods, supports rigorous ablations and stress tests, and lowers barriers for practitioners seeking robust performance estimates for real-world synthesis planning workloads through standardized, transparent evaluation scaffolding.

Abstract: We present SynRXN, a unified benchmarking framework and open-data resource for computer-aided synthesis planning (CASP). SynRXN decomposes end-to-end synthesis planning into five task families, covering reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, and synthesis route design. Curated, provenance-tracked reaction corpora are assembled from heterogeneous public sources into a harmonized representation and packaged as versioned datasets for each task family, with explicit source metadata, licence tags, and machine-readable manifests that record checksums, and row counts. For every task, SynRXN provides transparent splitting functions that generate leakage-aware train, validation, and test partitions, together with standardized evaluation workflows and metric suites tailored to classification, regression, and structured prediction settings. For sensitive benchmarking, we combine public training and validation data with held-out gold-standard test sets, and contamination-prone tasks such as reaction rebalancing and atom-to-atom mapping are distributed only as evaluation sets and are explicitly not intended for model training. Scripted build recipes enable bitwise-reproducible regeneration of all corpora across machines and over time, and the entire resource is released under permissive open licences to support reuse and extension. By removing dataset heterogeneity and packaging transparent, reusable evaluation scaffolding, SynRXN enables fair longitudinal comparison of CASP methods, supports rigorous ablations and stress tests along the full reaction-informatics pipeline, and lowers the barrier for practitioners who seek robust and comparable performance estimates for real-world synthesis planning workloads.

Bo Yin, Qi Li, Runpeng Yu, Xinchao Wang

Main category: cs.LG

TL;DR: The paper introduces Refinement Provenance Inference (RPI) to audit whether a fine-tuned model was trained on original prompts or their LLM-refined versions, proposing RePro framework that detects distribution shifts in token likelihoods.

Details

Motivation: As instruction tuning increasingly uses LLM-based prompt refinement, there's a need for instance-level audit methods to determine whether models were trained on original or refined prompts, which is crucial for dataset governance and dispute resolution when training data are contested.

Method: Proposes RePro framework that detects stable distribution shifts in teacher-forced token distributions from prompt refinement. Uses logit-based provenance approach combining teacher-forced likelihood features with logit-ranking signals, learns transferable representations via shadow fine-tuning, and applies lightweight linear head for inference without training-data access.

Result: RePro consistently achieves strong performance and transfers well across different refiners, suggesting it exploits refiner-agnostic distribution shifts rather than rewrite-style artifacts.

Conclusion: Prompt refinement creates detectable distribution shifts that enable reliable provenance inference, and the proposed RePro framework provides an effective solution for the RPI audit task with good generalization across models and training setups.

Abstract: Instruction tuning increasingly relies on LLM-based prompt refinement, where prompts in the training corpus are selectively rewritten by an external refiner to improve clarity and instruction alignment. This motivates an instance-level audit problem: for a fine-tuned model and a training prompt-response pair, can we infer whether the model was trained on the original prompt or its LLM-refined version within a mixed corpus? This matters for dataset governance and dispute resolution when training data are contested. However, it is non-trivial in practice: refined and raw instances are interleaved in the training corpus with unknown, source-dependent mixture ratios, making it harder to develop provenance methods that generalize across models and training setups. In this paper, we formalize this audit task as Refinement Provenance Inference (RPI) and show that prompt refinement yields stable, detectable shifts in teacher-forced token distributions, even when semantic differences are not obvious. Building on this phenomenon, we propose RePro, a logit-based provenance framework that fuses teacher-forced likelihood features with logit-ranking signals. During training, RePro learns a transferable representation via shadow fine-tuning, and uses a lightweight linear head to infer provenance on unseen victims without training-data access. Empirically, RePro consistently attains strong performance and transfers well across refiners, suggesting that it exploits refiner-agnostic distribution shifts rather than rewrite-style artifacts.

[617] SerpentFlow: Generative Unpaired Domain Alignment via Shared-Structure Decomposition

Julie Keisler, Anastase Alexandre Charantonis, Yannig Goude, Boutheina Oueslati, Claire Monteleoni

Main category: cs.LG

TL;DR: SerpentFlow is a generative framework for unpaired domain alignment that decomposes data into shared and domain-specific components, enabling conditional generative models in unpaired settings through synthetic training pairs.

Details

Motivation: Domain alignment without paired observations is challenging as it removes direct supervision across domains. The paper addresses the need for methods that can learn correspondences between domains that share underlying structural patterns despite differences in their specific realizations.

Method: SerpentFlow decomposes data in latent space into shared (common to both domains) and domain-specific components. By isolating shared structure and replacing domain-specific components with stochastic noise, it constructs synthetic training pairs between shared representations and target-domain samples. This enables conditional generative models in unpaired settings. The method uses a classifier-based criterion to automatically determine the cutoff frequency separating low- and high-frequency components for data-driven decomposition.

Result: Experiments on synthetic images, physical process simulations, and climate downscaling tasks demonstrate that SerpentFlow effectively reconstructs high-frequency structures consistent with underlying low-frequency patterns, supporting shared-structure decomposition as an effective strategy for unpaired domain alignment.

Conclusion: Shared-structure decomposition provides an effective strategy for unpaired domain alignment, enabling conditional generative models traditionally restricted to paired settings. The framework is compatible with various conditional generative approaches and shows promise for applications like super-resolution tasks where low-frequency content corresponds to shared structure and high-frequency details capture domain-specific variability.

Abstract: Domain alignment refers broadly to learning correspondences between data distributions from distinct domains. In this work, we focus on a setting where domains share underlying structural patterns despite differences in their specific realizations. The task is particularly challenging in the absence of paired observations, which removes direct supervision across domains. We introduce a generative framework, called SerpentFlow (SharEd-structuRe decomPosition for gEnerative domaiN adapTation), for unpaired domain alignment. SerpentFlow decomposes data within a latent space into a shared component common to both domains and a domain-specific one. By isolating the shared structure and replacing the domain-specific component with stochastic noise, we construct synthetic training pairs between shared representations and target-domain samples, thereby enabling the use of conditional generative models that are traditionally restricted to paired settings. We apply this approach to super-resolution tasks, where the shared component naturally corresponds to low-frequency content while high-frequency details capture domain-specific variability. The cutoff frequency separating low- and high-frequency components is determined automatically using a classifier-based criterion, ensuring a data-driven and domain-adaptive decomposition. By generating pseudo-pairs that preserve low-frequency structures while injecting stochastic high-frequency realizations, we learn the conditional distribution of the target domain given the shared representation. We implement SerpentFlow using Flow Matching as the generative pipeline, although the framework is compatible with other conditional generative approaches. Experiments on synthetic images, physical process simulations, and a climate downscaling task demonstrate that the method effectively reconstructs high-frequency structures consistent with underlying low-frequency patterns, supporting shared-structure decomposition as an effective strategy for unpaired domain alignment.

[618] Prior Diffusiveness and Regret in the Linear-Gaussian Bandit

Yifan Zhu, John C. Duchi, Benjamin Van Roy

Main category: cs.LG

TL;DR: Thompson sampling achieves $\tilde{O}(σd \sqrt{T} + d r \sqrt{\mathrm{Tr}(Σ_0)})$ Bayesian regret in linear-Gaussian bandits, showing additive rather than multiplicative dependence between minimax regret and prior-dependent burn-in term.

Details

Motivation: Previous regret bounds for Thompson sampling in linear-Gaussian bandits show multiplicative dependence between the minimax regret term and the prior-dependent "burn-in" term. The authors aim to prove that these terms actually decouple additively, providing a more refined understanding of Thompson sampling's performance.

Method: The authors prove their regret bound using a new “elliptical potential” lemma. They also provide a lower bound to demonstrate that the burn-in term is unavoidable, establishing the tightness of their result.

Result: Thompson sampling achieves $\tilde{O}(σd \sqrt{T} + d r \sqrt{\mathrm{Tr}(Σ_0)})$ Bayesian regret, where the prior-dependent burn-in term $d r \sqrt{\mathrm{Tr}(Σ_0)}$ adds to the minimax term $σd \sqrt{T}$ rather than multiplying it. This represents an improvement over previous bounds that showed multiplicative dependence.

Conclusion: The paper establishes that Thompson sampling’s regret in linear-Gaussian bandits consists of two additive components: a minimax term and a prior-dependent burn-in term. The new elliptical potential lemma enables this refined analysis, and the lower bound confirms the necessity of the burn-in term.

Abstract: We prove that Thompson sampling exhibits $\tilde{O}(σd \sqrt{T} + d r \sqrt{\mathrm{Tr}(Σ_0)})$ Bayesian regret in the linear-Gaussian bandit with a $\mathcal{N}(μ_0, Σ_0)$ prior distribution on the coefficients, where $d$ is the dimension, $T$ is the time horizon, $r$ is the maximum $\ell_2$ norm of the actions, and $σ^2$ is the noise variance. In contrast to existing regret bounds, this shows that to within logarithmic factors, the prior-dependent burn-in'' term $d r \sqrt{\mathrm{Tr}(Σ_0)}$ decouples additively from the minimax (long run) regret $σd \sqrt{T}$. Previous regret bounds exhibit a multiplicative dependence on these terms. We establish these results via a new elliptical potential’’ lemma, and also provide a lower bound indicating that the burn-in term is unavoidable.

[619] GDRO: Group-level Reward Post-training Suitable for Diffusion Models

Yiyang Wang, Xi Chen, Xiaogang Xu, Yu Liu, Hengshuang Zhao

Main category: cs.LG

TL;DR: GDRO is a new offline post-training method for aligning rectified flow diffusion models with group-level rewards, addressing efficiency, stochasticity, and reward hacking issues in existing online RL approaches.

Details

Motivation: Existing online RL methods for aligning text-to-image rectified flow models with rewards face three key problems: 1) Low efficiency due to time-consuming online image sampling, 2) Dependency on stochastic samplers (rectified flow is deterministic once noise is fixed), and 3) Reward hacking issues that can mislead evaluation.

Method: Group-level Direct Reward Optimization (GDRO) - a post-training paradigm that combines group-level reward alignment with rectified flow model characteristics. It supports full offline training (eliminating image rollout sampling time), is diffusion-sampler-independent (no need for ODE-to-SDE approximation), and includes corrected evaluation metrics to account for reward hacking.

Result: GDRO effectively improves reward scores across OCR and GenEval tasks through group-wise offline optimization. It demonstrates strong stability and robustness in mitigating reward hacking while being more efficient than online RL approaches.

Conclusion: GDRO provides an efficient, stable, and robust solution for aligning rectified flow diffusion models with group-level rewards, overcoming the limitations of online RL methods through offline optimization and proper handling of reward hacking.

Abstract: Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.

[620] Multivariate Time-series Anomaly Detection via Dynamic Model Pool & Ensembling

Wei Hu, Zewei Yu, Jianqiu Xu

Main category: cs.LG

TL;DR: DMPEAD: Dynamic Model Pool and Ensembling framework for MTS anomaly detection that addresses limitations of existing multi-model methods through adaptive pool construction, updating, and ensembling.

Details

Motivation: Current multi-model methods for multivariate time-series anomaly detection have three main limitations: (1) selection methods rely on single models and are sensitive to strategy choices, (2) ensembling methods either combine all models or are restricted to univariate data, and (3) most methods depend on fixed data dimensionality, limiting scalability.

Method: Three-step framework: (1) Construct diverse model pool using parameter transfer and diversity metrics, (2) Update pool with meta-model and similarity-based strategy for adaptive expansion, subset selection, and merging, (3) Ensemble top-ranked models through proxy metric ranking and top-k aggregation in selected subset.

Result: Extensive experiments on 8 real-world datasets show DMPEAD outperforms all baselines, demonstrating superior adaptability and scalability.

Conclusion: DMPEAD effectively addresses limitations of existing multi-model methods for MTS anomaly detection through dynamic model pool construction and adaptive ensembling, achieving state-of-the-art performance.

Abstract: Multivariate time-series (MTS) anomaly detection is critical in domains such as service monitor, IoT, and network security. While multi-model methods based on selection or ensembling outperform single-model ones, they still face limitations: (i) selection methods rely on a single chosen model and are sensitive to the strategy; (ii) ensembling methods often combine all models or are restricted to univariate data; and (iii) most methods depend on fixed data dimensionality, limiting scalability. To address these, we propose DMPEAD, a Dynamic Model Pool and Ensembling framework for MTS Anomaly Detection. The framework first (i) constructs a diverse model pool via parameter transfer and diversity metric, then (ii) updates it with a meta-model and similarity-based strategy for adaptive pool expansion, subset selection, and pool merging, finally (iii) ensembles top-ranked models through proxy metric ranking and top-k aggregation in the selected subset, outputting the final anomaly detection result. Extensive experiments on 8 real-world datasets show that our model outperforms all baselines, demonstrating superior adaptability and scalability.

[621] Explore the Ideology of Deep Learning in ENSO Forecasts

Yanhai Gan, Yipeng Chen, Ning Li, Xingguo Liu, Junyu Dong, Xianyao Chen

Main category: cs.LG

TL;DR: A deep learning framework with mathematical interpretability improves ENSO prediction by rescuing saturated neurons and analyzing predictability sources, revealing tropical Pacific dominance and Spring Predictability Barrier limitations.

Details

Motivation: While deep learning has improved ENSO forecasting, the opacity of these models limits scientific trust and operational deployment, creating a need for interpretable frameworks that maintain predictive performance while providing physical insights.

Method: Introduces a mathematically grounded interpretability framework based on bounded variation function that rescues “dead” neurons from activation function saturation zones, enhancing model expressive capacity while maintaining interpretability.

Result: Analysis reveals ENSO predictability primarily originates from tropical Pacific, with contributions from Indian and Atlantic Oceans, aligning with physical understanding. Controlled experiments confirm method robustness and established predictor alignment. Spring Predictability Barrier persists despite expanded sensitivity, suggesting suboptimal variable selection limits performance.

Conclusion: The interpretable deep learning framework successfully identifies ENSO predictability sources consistent with physical understanding. The persistent Spring Predictability Barrier suggests incorporating additional ocean-atmosphere variables could transcend current limitations and advance long-range ENSO prediction.

Abstract: The El Ni{~n}o-Southern Oscillation (ENSO) exerts profound influence on global climate variability, yet its prediction remains a grand challenge. Recent advances in deep learning have significantly improved forecasting skill, but the opacity of these models hampers scientific trust and operational deployment. Here, we introduce a mathematically grounded interpretability framework based on bounded variation function. By rescuing the “dead” neurons from the saturation zone of the activation function, we enhance the model’s expressive capacity. Our analysis reveals that ENSO predictability emerges dominantly from the tropical Pacific, with contributions from the Indian and Atlantic Oceans, consistent with physical understanding. Controlled experiments affirm the robustness of our method and its alignment with established predictors. Notably, we probe the persistent Spring Predictability Barrier (SPB), finding that despite expanded sensitivity during spring, predictive performance declines-likely due to suboptimal variable selection. These results suggest that incorporating additional ocean-atmosphere variables may help transcend SPB limitations and advance long-range ENSO prediction.

[622] The Homogeneity Trap: Spectral Collapse in Doubly-Stochastic Deep Networks

Yizhi Liu

Main category: cs.LG

TL;DR: DSM constraints in deep architectures cause spectral degradation (Homogeneity Trap) where high-entropy bias suppresses feature diversity, limiting effective depth and causing irreversible noise-induced collapse.

Details

Motivation: Doubly-stochastic matrices are widely used in deep architectures for stability and interpretability, but their spectral properties and limitations need investigation to understand fundamental trade-offs between stability and expressivity.

Method: Analyzed spectral degradation in DSM-constrained networks, derived spectral bounds linking subdominant singular values to effective depth, and formally demonstrated limitations of Layer Normalization in noise-dominated regimes.

Result: Identified Homogeneity Trap where maximum-entropy bias drives mixing operators toward uniform barycenter, suppressing feature diversity and restricting effective receptive field. Layer Normalization fails when SNR drops below critical threshold, leading to irreversible orthogonal collapse.

Conclusion: There’s a fundamental trade-off between entropic stability and spectral expressivity in DSM-constrained networks - high entropy constraints limit feature transformation capacity and can cause irreversible structural loss in noisy conditions.

Abstract: Doubly-stochastic matrices (DSM) are increasingly utilized in structure-preserving deep architectures – such as Optimal Transport layers and Sinkhorn-based attention – to enforce numerical stability and probabilistic interpretability. In this work, we identify a critical spectral degradation phenomenon inherent to these constraints, termed the Homogeneity Trap. We demonstrate that the maximum-entropy bias, typical of Sinkhorn-based projections, drives the mixing operator towards the uniform barycenter, thereby suppressing the subdominant singular value σ_2 and filtering out high-frequency feature components. We derive a spectral bound linking σ_2 to the network’s effective depth, showing that high-entropy constraints restrict feature transformation to a shallow effective receptive field. Furthermore, we formally demonstrate that Layer Normalization fails to mitigate this collapse in noise-dominated regimes; specifically, when spectral filtering degrades the Signal-to-Noise Ratio (SNR) below a critical threshold, geometric structure is irreversibly lost to noise-induced orthogonal collapse. Our findings highlight a fundamental trade-off between entropic stability and spectral expressivity in DSM-constrained networks.

[623] A Differentiable Adversarial Framework for Task-Aware Data Subsampling

Jiacheng Lyu, Bihua Bao

Main category: cs.LG

TL;DR: ASSS is a differentiable, task-aware data subsampling framework that uses adversarial learning between selector and task networks to assign importance weights to samples, outperforming heuristic methods while sometimes exceeding full dataset performance.

Details

Motivation: Traditional data subsampling methods are static and task-independent, discarding potentially critical information for downstream tasks. There's a need for intelligent, task-aware data reduction that preserves maximum relevant information.

Method: ASSS uses an adversarial game between a selector network and a task network. The selector assigns continuous importance weights to samples via Gumbel-Softmax relaxation, directly optimizing for samples with maximum task-relevant information while balancing fidelity and sparsity.

Result: ASSS consistently outperforms heuristic baselines (clustering, nearest neighbor thinning) on four large-scale real-world datasets. Notably, it can sometimes exceed the training performance of using the entire dataset, demonstrating intelligent denoising effects.

Conclusion: ASSS establishes task-aware data subsampling as a learnable component, providing a principled solution for effective large-scale data learning through differentiable end-to-end optimization guided by information bottleneck principles.

Abstract: The proliferation of large-scale datasets poses a major computational challenge to model training. The traditional data subsampling method works as a static, task independent preprocessing step which usually discards information that is critical to downstream prediction. In this paper, we introduces the antagonistic soft selection subsampling (ASSS) framework as is a novel paradigm that reconstructs data reduction into a differentiable end-to-end learning problem. ASSS uses the adversarial game between selector network and task network, and selector network learning assigns continuous importance weights to samples. This direct optimization implemented by Gumbel-Softmax relaxation allows the selector to identify and retain samples with the maximum amount of information for a specific task target under the guidance of the loss function that balances the fidelity and sparsity of the prediction. Theoretical analysis links this framework with the information bottleneck principle. Comprehensive experiments on four large-scale real world datasets show that ASSS has always been better than heuristic subsampling baselines such as clustering and nearest neighbor thinning in maintaining model performance. It is worth noting that ASSS can not only match, but also sometimes exceed the training performance of the entire dataset, showcasing the effect of intelligent denoising. This work establishes task aware data subsampling as a learnable component, providing a principled solution for effective large-scale data learning.

[624] LION-DG: Layer-Informed Initialization with Deep Gradient Protocols for Accelerated Neural Network Training

Hyunjun Kim

Main category: cs.LG

TL;DR: LION-DG is a layer-informed initialization method that zero-initializes auxiliary classifier heads while using standard He-initialization for the backbone, enabling gradient awakening without hyperparameters.

Details

Motivation: Existing weight initialization methods are layer-agnostic and don't address the problem of untrained auxiliary classifier heads in deeply-supervised architectures, which can destabilize early training through gradient interference.

Method: Proposes LION-DG (Layer-Informed Initialization for Deeply-supervised architectures with Gradient awakening) that zero-initializes auxiliary classifier heads while applying standard He-initialization to the backbone network.

Result: Experiments show: DenseNet-DS achieves +8.3% faster convergence on CIFAR-10; Hybrid LSUV+LION-DG achieves best accuracy (81.92% on CIFAR-10); ResNet-DS shows +11.3% speedup on CIFAR-100 with side-tap auxiliary design.

Conclusion: LION-DG provides a simple, hyperparameter-free solution that enables gradient awakening for auxiliary heads, improves convergence speed and accuracy, and adds no computational overhead.

Abstract: Weight initialization remains decisive for neural network optimization, yet existing methods are largely layer-agnostic. We study initialization for deeply-supervised architectures with auxiliary classifiers, where untrained auxiliary heads can destabilize early training through gradient interference. We propose LION-DG, a layer-informed initialization that zero-initializes auxiliary classifier heads while applying standard He-initialization to the backbone. We prove that this implements Gradient Awakening: auxiliary gradients are exactly zero at initialization, then phase in naturally as weights grow – providing an implicit warmup without hyperparameters. Experiments on CIFAR-10 and CIFAR-100 with DenseNet-DS and ResNet-DS architectures demonstrate: (1) DenseNet-DS: +8.3% faster convergence on CIFAR-10 with comparable accuracy, (2) Hybrid approach: Combining LSUV with LION-DG achieves best accuracy (81.92% on CIFAR-10), (3) ResNet-DS: Positive speedup on CIFAR-100 (+11.3%) with side-tap auxiliary design. We identify architecture-specific trade-offs and provide clear guidelines for practitioners. LION-DG is simple, requires zero hyperparameters, and adds no computational overhead.

[625] Horizon Activation Mapping for Neural Networks in Time Series Forecasting

Hans Krupakar, V A Kandappan

Main category: cs.LG

TL;DR: HAM (Horizon Activation Mapping) is a visual interpretability technique for time series forecasting models that uses gradient norm averages to analyze subseries importance across different horizon timesteps, enabling model-agnostic comparison and selection.

Details

Motivation: Existing interpretability approaches for time series forecasting are model-specific and don't work across different neural network families, making it difficult to compare and select models consistently.

Method: HAM adapts grad-CAM from computer vision to time series by using gradient norm averages across subseries at each timestep. It introduces causal and anti-causal modes, and studies optimization landscapes with respect to batch sizes, early stopping, train-val-test splits, and dropouts.

Result: HAM successfully visualizes gradient activities across various state-of-the-art models (CycleNet, N-Linear, N-HITS, FEDformer, Pyraformer, SpaceTime, Multi-Resolution DDPM). Batch size differences reveal potential exponential approximations, and specific patterns are observed for NHITS and SpaceTime models.

Conclusion: HAM provides a unified interpretability framework for time series forecasting models across different architectures, enabling granular model selection, validation set choices, and cross-family comparisons.

Abstract: Neural networks for time series forecasting have relied on error metrics and architecture-specific interpretability approaches for model selection that don’t apply across models of different families. To interpret forecasting models agnostic to the types of layers across state-of-the-art model families, we introduce Horizon Activation Mapping (HAM), a visual interpretability technique inspired by grad-CAM that uses gradient norm averages to study the horizon’s subseries where grad-CAM studies attention maps over image data. We introduce causal and anti-causal modes to calculate gradient update norm averages across subseries at every timestep and lines of proportionality signifying uniform distributions of the norm averages. Optimization landscape studies with respect to changes in batch sizes, early stopping, train-val-test splits, univariate forecasting and dropouts are studied with respect to performances and subseries in HAM. Interestingly, batch size based differences in activities seem to indicate potential for existence of an exponential approximation across them per epoch relative to each other. Multivariate forecasting models including MLP-based CycleNet, N-Linear, N-HITS, self attention-based FEDformer, Pyraformer, SSM-based SpaceTime and diffusion-based Multi-Resolution DDPM over different horizon sizes trained over the ETTm2 dataset are used for HAM plots in this study. NHITS’ neural approximation theorem and SpaceTime’s exponential autoregressive activities have been attributed to trends in HAM plots over their training, validation and test sets. In general, HAM can be used for granular model selection, validation set choices and comparisons across different neural network model families.

[626] Prototype-Based Learning for Healthcare: A Demonstration of Interpretable AI

Ashish Rana, Ammar Shaker, Sascha Saralajew, Takashi Suzuki, Kosuke Yasuda, Shintaro Kato, Toshikazu Wada, Toshiyuki Fujikawa, Toru Kikutsuji

Main category: cs.LG

TL;DR: ProtoPal framework uses prototype-based learning for personalized preventive healthcare, offering both high performance and intuitive explanations of interventions and outcomes.

Details

Motivation: There's a gap in personalized preventive healthcare where predictions and interventions need to be both understandable and verifiable for all healthcare stakeholders, despite advances in ML and explainable AI.

Method: ProtoPal framework uses prototype-based learning with both front-end and back-end modes to provide intuitive presentation of interventions and their simulated outcomes.

Result: The framework achieves superior quantitative performance while providing understandable explanations of healthcare interventions.

Conclusion: Prototype-based learning can effectively address the need for both high performance and explainability in personalized preventive healthcare systems.

Abstract: Despite recent advances in machine learning and explainable AI, a gap remains in personalized preventive healthcare: predictions, interventions, and recommendations should be both understandable and verifiable for all stakeholders in the healthcare sector. We present a demonstration of how prototype-based learning can address these needs. Our proposed framework, ProtoPal, features both front- and back-end modes; it achieves superior quantitative performance while also providing an intuitive presentation of interventions and their simulated outcomes.

[627] Edge-aware GAT-based protein binding site prediction

Weisen Yang, Hanqing Zhang, Wangren Qiu, Xuan Xiao, Weizhong Lin

Main category: cs.LG

TL;DR: Edge-aware Graph Attention Network for accurate and efficient prediction of protein binding sites across various biomolecules, achieving state-of-the-art performance with ROC-AUC of 0.93.

Details

Motivation: Traditional methods struggle to balance prediction accuracy with computational efficiency when capturing complex spatial conformations of protein binding sites, which is crucial for understanding biomolecular interactions and drug design.

Method: Proposes Edge-aware Graph Attention Network (Edge-aware GAT) that constructs atom-level graphs with multidimensional structural features (geometric descriptors, DSSP secondary structure, RSA). Incorporates interatomic distances and directional vectors as edge features in attention mechanism, and uses directional tensor propagation with residue-level attention pooling.

Result: Achieves ROC-AUC of 0.93 for protein-protein binding site prediction on benchmark datasets, outperforming several state-of-the-art methods. Visualizations confirm practical utility and interpretability. Public web server deployed at http://119.45.201.89:5000/.

Conclusion: The approach offers a novel and efficient solution that balances prediction accuracy, generalization, and interpretability for identifying functional sites in proteins across various biomolecules.

Abstract: Accurate identification of protein binding sites is crucial for understanding biomolecular interaction mechanisms and for the rational design of drug targets. Traditional predictive methods often struggle to balance prediction accuracy with computational efficiency when capturing complex spatial conformations. To address this challenge, we propose an Edge-aware Graph Attention Network (Edge-aware GAT) model for the fine-grained prediction of binding sites across various biomolecules, including proteins, DNA/RNA, ions, ligands, and lipids. Our method constructs atom-level graphs and integrates multidimensional structural features, including geometric descriptors, DSSP-derived secondary structure, and relative solvent accessibility (RSA), to generate spatially aware embedding vectors. By incorporating interatomic distances and directional vectors as edge features within the attention mechanism, the model significantly enhances its representation capacity. On benchmark datasets, our model achieves an ROC-AUC of 0.93 for protein-protein binding site prediction, outperforming several state-of-the-art methods. The use of directional tensor propagation and residue-level attention pooling further improves both binding site localization and the capture of local structural details. Visualizations using PyMOL confirm the model’s practical utility and interpretability. To facilitate community access and application, we have deployed a publicly accessible web server at http://119.45.201.89:5000/. In summary, our approach offers a novel and efficient solution that balances prediction accuracy, generalization, and interpretability for identifying functional sites in proteins.

[628] Learning with Monotone Adversarial Corruptions

Kasper Green Larsen, Chirag Pabbaraju, Abhishek Shetty

Main category: cs.LG

TL;DR: Standard ML algorithms fail under monotone adversarial corruption despite helpful-looking corruptions, while uniform convergence methods remain robust.

Details

Motivation: To investigate how standard ML algorithms rely on exchangeability and independence assumptions by testing them against monotone adversarial corruptions that appear helpful but break optimal learning.

Method: Introduce a monotone adversarial corruption model where an adversary inserts corrupted points that are labeled according to the ground-truth target function. Test known optimal binary classification algorithms and uniform convergence-based methods under this corruption.

Result: All known optimal binary classification algorithms achieve suboptimal expected error under monotone corruptions, while uniform convergence-based algorithms maintain their performance guarantees.

Conclusion: Optimal learning algorithms over-rely on exchangeability assumptions and break down under seemingly helpful monotone corruptions, exposing fundamental limitations in their design.

Abstract: We study the extent to which standard machine learning algorithms rely on exchangeability and independence of data by introducing a monotone adversarial corruption model. In this model, an adversary, upon looking at a “clean” i.i.d. dataset, inserts additional “corrupted” points of their choice into the dataset. These added points are constrained to be monotone corruptions, in that they get labeled according to the ground-truth target function. Perhaps surprisingly, we demonstrate that in this setting, all known optimal learning algorithms for binary classification can be made to achieve suboptimal expected error on a new independent test point drawn from the same distribution as the clean dataset. On the other hand, we show that uniform convergence-based algorithms do not degrade in their guarantees. Our results showcase how optimal learning algorithms break down in the face of seemingly helpful monotone corruptions, exposing their overreliance on exchangeability.

[629] ACDZero: Graph-Embedding-Based Tree Search for Mastering Automated Cyber Defense

Yu Li, Sizhe Tang, Rongqian Chen, Fei Xu Yu, Guangyu Jiang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

Main category: cs.LG

TL;DR: The paper proposes a planning-centric defense policy using Monte Carlo Tree Search (MCTS) with graph neural networks for automated cyber defense, achieving better performance than RL baselines in complex network scenarios.

Details

Motivation: Existing deep reinforcement learning approaches for automated cyber defense face difficult exploration in complex networks with large decision/state spaces, requiring expensive amounts of samples. There's a need for sample-efficient defense policies.

Method: Frames automated cyber defense as a context-based partially observable Markov decision problem. Uses Monte Carlo Tree Search (MCTS) for planning, with graph neural networks to embed network observations as attributed graphs for permutation-invariant reasoning. Combines learned graph embeddings and priors over graph-edit actions with MCTS, integrating model-free generalization and policy distillation with look-ahead planning.

Result: The proposed search-guided, graph-embedding-based planning agent improves defense reward and robustness relative to state-of-the-art RL baselines on CAGE Challenge 4 scenarios involving diverse network structures and adversary behaviors.

Conclusion: The planning-centric approach using MCTS with graph neural networks provides a more sample-efficient and effective solution for automated cyber defense in complex network environments compared to traditional reinforcement learning methods.

Abstract: Automated cyber defense (ACD) seeks to protect computer networks with minimal or no human intervention, reacting to intrusions by taking corrective actions such as isolating hosts, resetting services, deploying decoys, or updating access controls. However, existing approaches for ACD, such as deep reinforcement learning (RL), often face difficult exploration in complex networks with large decision/state spaces and thus require an expensive amount of samples. Inspired by the need to learn sample-efficient defense policies, we frame ACD in CAGE Challenge 4 (CAGE-4 / CC4) as a context-based partially observable Markov decision problem and propose a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). It explicitly models the exploration-exploitation tradeoff in ACD and uses statistical sampling to guide exploration and decision making. We make novel use of graph neural networks (GNNs) to embed observations from the network as attributed graphs, to enable permutation-invariant reasoning over hosts and their relationships. To make our solution practical in complex search spaces, we guide MCTS with learned graph embeddings and priors over graph-edit actions, combining model-free generalization and policy distillation with look-ahead planning. We evaluate the resulting agent on CC4 scenarios involving diverse network structures and adversary behaviors, and show that our search-guided, graph-embedding-based planning improves defense reward and robustness relative to state-of-the-art RL baselines.

[630] CORE: Code-based Inverse Self-Training Framework with Graph Expansion for Virtual Agents

Keyu Wang, Bingchen Miao, Wendong Bu, Yu Wu, Juncheng Li, Shengyu Zhang, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

Main category: cs.LG

TL;DR: CORE is a code-based inverse self-training framework that bridges imitation learning and reinforcement learning for multimodal virtual agents, eliminating manual reward design while enhancing behavioral diversity through semantic code abstraction and graph expansion.

Details

Motivation: Current training paradigms for multimodal virtual agents face limitations: Behavior Cloning suffers from low behavioral diversity, while Reinforcement Learning requires manually designed reward functions. There's a need to bridge imitation and exploration without manual reward design.

Method: CORE introduces: 1) Semantic Code Abstraction to automatically infer reward functions (Label Functions) from expert demonstrations as executable code; 2) Strategy Graph Expansion to construct multi-path graphs capturing diverse valid solutions; 3) Trajectory-Guided Extrapolation using both successful and failed trajectories to expand task space.

Result: Experiments on Web and Android platforms show CORE significantly improves both overall performance and generalization, demonstrating its effectiveness as a robust training paradigm for virtual agents.

Conclusion: CORE presents a novel framework that bridges imitation and exploration, eliminating manual reward design while promoting behavioral diversity, offering a promising approach for building powerful and generalizable multimodal virtual agents.

Abstract: The development of Multimodal Virtual Agents has made significant progress through the integration of Multimodal Large Language Models. However, mainstream training paradigms face key challenges: Behavior Cloning is simple and effective through imitation but suffers from low behavioral diversity, while Reinforcement Learning is capable of discovering novel strategies through exploration but heavily relies on manually designed reward functions. To address the conflict between these two methods, we present CORE, a Code-based Inverse Self-Training Framework with Graph Expansion that bridges imitation and exploration, offering a novel training framework that promotes behavioral diversity while eliminating the reliance on manually reward design. Specifically, we introduce Semantic Code Abstraction to automatically infers reward functions from expert demonstrations without manual design. The inferred reward function, referred to as the Label Function, is executable code that verifies one key step within a task. Building on this, we propose Strategy Graph Expansion to enhance in-domain behavioral diversity, which constructs a multi-path graph called Strategy Graph that captures diverse valid solutions beyond expert demonstrations. Furthermore, we introduce Trajectory-Guided Extrapolation, which enriches out-of-domain behavioral diversity by utilizing both successful and failed trajectories to expand the task space. Experiments on Web and Android platforms demonstrate that CORE significantly improves both overall performance and generalization, highlighting its potential as a robust and generalizable training paradigm for building powerful virtual agents.

[631] Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction

Haoyu Zhou, Ping Xue, Tianfan Fu, Hao Zhang

Main category: cs.LG

TL;DR: This paper introduces quantization techniques to compress and accelerate SO(3)-equivariant graph neural networks for deployment on edge devices, achieving comparable accuracy with 2.37-2.73x faster inference and 4x smaller models.

Details

Motivation: 3D graph neural networks that are equivariant to 3D rotations (SO(3)) have high computational costs that make deployment on edge devices challenging, limiting their practical use in chemistry applications.

Method: Three innovations: (1) magnitude-direction decoupled quantization that separately quantizes norm and orientation of equivariant features, (2) branch-separated quantization-aware training that treats invariant and equivariant channels differently in attention-based SO(3)-GNNs, and (3) robustness-enhancing attention normalization to stabilize low-precision attention computations.

Result: 8-bit models achieve accuracy on energy and force predictions comparable to full-precision baselines on QM9 and rMD17 molecular benchmarks, with 2.37-2.73x faster inference and 4x smaller model size while maintaining equivariance properties.

Conclusion: The proposed quantization techniques enable practical deployment of symmetry-aware GNNs in chemistry applications without sacrificing accuracy or physical symmetry, making SO(3)-equivariant models viable for edge devices.

Abstract: Deploying 3D graph neural networks (GNNs) that are equivariant to 3D rotations (the group SO(3)) on edge devices is challenging due to their high computational cost. This paper addresses the problem by compressing and accelerating an SO(3)-equivariant GNN using low-bit quantization techniques. Specifically, we introduce three innovations for quantized equivariant transformers: (1) a magnitude-direction decoupled quantization scheme that separately quantizes the norm and orientation of equivariant (vector) features, (2) a branch-separated quantization-aware training strategy that treats invariant and equivariant feature channels differently in an attention-based $SO(3)$-GNN, and (3) a robustness-enhancing attention normalization mechanism that stabilizes low-precision attention computations. Experiments on the QM9 and rMD17 molecular benchmarks demonstrate that our 8-bit models achieve accuracy on energy and force predictions comparable to full-precision baselines with markedly improved efficiency. We also conduct ablation studies to quantify the contribution of each component to maintain accuracy and equivariance under quantization, using the Local error of equivariance (LEE) metric. The proposed techniques enable the deployment of symmetry-aware GNNs in practical chemistry applications with 2.37–2.73x faster inference and 4x smaller model size, without sacrificing accuracy or physical symmetry.

[632] ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, Kaushik Roy

Main category: cs.LG

TL;DR: ELLA is a continual learning framework for LLMs that prevents catastrophic forgetting without data replay by selectively decorrelating task-specific subspaces while preserving shared representations for forward transfer.

Details

Motivation: LLMs suffer severe catastrophic forgetting when adapted sequentially to new tasks. Existing approaches are limited: replay-based methods are impractical and violate privacy, while strict orthogonality methods reduce degrees of freedom and eliminate forward transfer by forbidding all overlap in representations.

Method: ELLA uses selective subspace de-correlation - it characterizes past update structures and penalizes alignments along high-energy, task-specific directions while preserving freedom in low-energy residual subspaces. This is implemented via a lightweight regularizer on a single aggregated update matrix, corresponding to an anisotropic shrinkage operator that bounds interference.

Result: ELLA achieves state-of-the-art CL performance on three benchmarks with up to 9.6% relative accuracy gains and 35× smaller memory footprint. It requires no data replay, no architectural expansion, and negligible storage. It scales robustly across architectures and enhances zero-shot generalization on unseen tasks.

Conclusion: ELLA provides a principled and scalable solution for constructive lifelong LLM adaptation through selective subspace de-correlation, enabling forward transfer while preventing catastrophic forgetting without privacy-violating replay methods.

Abstract: Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model’s zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

[633] Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission

Emrah Mete, Emin Erkan Korkmaz

Main category: cs.LG

TL;DR: Neuro-Channel Networks (NCN) is a multiplication-free neural architecture inspired by biological synapses that replaces weights with channel widths and neurotransmitters, using only addition/subtraction operations to enable efficient AI on commodity hardware.

Details

Motivation: Current deep learning is constrained by expensive, energy-intensive GPUs due to reliance on matrix multiplications. Biological nervous systems achieve efficiency without arithmetic intensity, using physical ion channel limits and neurotransmitter regulation instead.

Method: Propose Neuro-Channel Networks (NCN) that replace weights with Channel Widths (physical signal limits) and Neurotransmitter parameters (regulate signal transmission based on sign logic). Forward pass uses only addition, subtraction, and bitwise operations (minimum, sign), eliminating floating-point multiplication entirely.

Result: NCNs can solve non-linearly separable problems like XOR and Majority function with 100% accuracy using standard backpropagation, demonstrating capability to form complex decision boundaries without multiplicative weights.

Conclusion: NCN architecture offers highly efficient alternative for next-generation neuromorphic hardware, enabling complex models to run on commodity CPUs or ultra-low-power chips without relying on costly GPU clusters.

Abstract: The rapid proliferation of Deep Learning is increasingly constrained by its heavy reliance on high-performance hardware, particularly Graphics Processing Units (GPUs). These specialized accelerators are not only prohibitively expensive and energy-intensive but also suffer from significant supply scarcity, limiting the ubiquity of Artificial Intelligence (AI) deployment on edge devices. The core of this inefficiency stems from the standard artificial perceptron’s dependence on intensive matrix multiplications. However, biological nervous systems achieve unparalleled efficiency without such arithmetic intensity; synaptic signal transmission is regulated by physical ion channel limits and chemical neurotransmitter levels rather than a process that can be analogous to arithmetic multiplication. Inspired by this biological mechanism, we propose Neuro-Channel Networks (NCN), a novel multiplication-free architecture designed to decouple AI from expensive hardware dependencies. In our model, weights are replaced with Channel Widths that physically limit the signal magnitude, while a secondary parameter acts as a Neurotransmitter to regulate Signal Transmission based on sign logic. The forward pass relies exclusively on addition, subtraction, and bitwise operations (minimum, sign), eliminating floating-point multiplication entirely. In this proof-of-concept study, we demonstrate that NCNs can solve non-linearly separable problems like XOR and the Majority function with 100% accuracy using standard backpropagation, proving their capability to form complex decision boundaries without multiplicative weights. This architecture offers a highly efficient alternative for next-generation neuromorphic hardware, paving the way for running complex models on commodity CPUs or ultra-low-power chips without relying on costly GPU clusters.

[634] POSEIDON: Physics-Optimized Seismic Energy Inference and Detection Operating Network

Boris Kriuk, Fedor Kriuk

Main category: cs.LG

TL;DR: POSEIDON is a physics-informed energy-based model for multi-task seismic prediction that embeds seismological laws as learnable constraints and achieves state-of-the-art performance across aftershock identification, tsunami potential, and foreshock detection.

Details

Motivation: Existing machine learning approaches for earthquake prediction often operate as black boxes that ignore established physical laws, creating a need for physics-informed models that can leverage both data and fundamental seismological principles.

Method: POSEIDON embeds fundamental seismological principles (Gutenberg-Richter magnitude-frequency relationship and Omori-Utsu aftershock decay law) as learnable constraints within an energy-based modeling framework. It uses the Poseidon dataset - the largest open-source global earthquake catalog with 2.8 million events spanning 30 years.

Result: POSEIDON achieves state-of-the-art performance across all three tasks (aftershock sequence identification, tsunami generation potential, and foreshock detection), outperforming gradient boosting, random forest, and CNN baselines with the highest average F1 score. The learned physics parameters converge to scientifically interpretable values within established seismological ranges.

Conclusion: Physics-informed machine learning models like POSEIDON can enhance predictive accuracy while maintaining scientific interpretability, and the publicly available Poseidon dataset provides valuable resources for advancing physics-informed seismic research.

Abstract: Earthquake prediction and seismic hazard assessment remain fundamental challenges in geophysics, with existing machine learning approaches often operating as black boxes that ignore established physical laws. We introduce POSEIDON (Physics-Optimized Seismic Energy Inference and Detection Operating Network), a physics-informed energy-based model for unified multi-task seismic event prediction, alongside the Poseidon dataset – the largest open-source global earthquake catalog comprising 2.8 million events spanning 30 years. POSEIDON embeds fundamental seismological principles, including the Gutenberg-Richter magnitude-frequency relationship and Omori-Utsu aftershock decay law, as learnable constraints within an energy-based modeling framework. The architecture simultaneously addresses three interconnected prediction tasks: aftershock sequence identification, tsunami generation potential, and foreshock detection. Extensive experiments demonstrate that POSEIDON achieves state-of-the-art performance across all tasks, outperforming gradient boosting, random forest, and CNN baselines with the highest average F1 score among all compared methods. Crucially, the learned physics parameters converge to scientifically interpretable values – Gutenberg-Richter b-value of 0.752 and Omori-Utsu parameters p=0.835, c=0.1948 days – falling within established seismological ranges while enhancing rather than compromising predictive accuracy. The Poseidon dataset is publicly available at https://huggingface.co/datasets/BorisKriuk/Poseidon, providing pre-computed energy features, spatial grid indices, and standardized quality metrics to advance physics-informed seismic research.

[635] Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck

Dina El Zein, James Henderson

Main category: cs.LG

TL;DR: NVDP: A privacy-preserving method that adds noise to transformer embeddings using a Nonparametric Variational Information Bottleneck layer to achieve differential privacy while maintaining utility.

Details

Motivation: Transformer embeddings can encode sensitive information from input text, making it vulnerable to adversarial recovery attacks. The multi-vector nature of transformer embeddings exacerbates this privacy risk.

Method: Proposes Nonparametric Variational Differential Privacy (NVDP) which integrates a Nonparametric Variational Information Bottleneck (NVIB) layer into transformer architecture to inject calibrated noise into multi-vector embeddings. Uses Rényi divergence and Bayesian Differential Privacy (BDP) for privacy measurement.

Result: Tested on GLUE benchmark, NVDP shows a useful tradeoff between privacy and accuracy. With lower noise levels, the model maintains high accuracy while offering strong privacy guarantees.

Conclusion: NVDP effectively balances privacy and utility for text data sharing by providing differential privacy protection for transformer embeddings through calibrated noise injection.

Abstract: We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings. It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy approach, integrating a Nonparametric Variational Information Bottleneck (NVIB) layer into the transformer architecture to inject noise into its multi-vector embeddings and thereby hide information, and measuring privacy protection with Rényi divergence and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to utility. We test NVDP on the GLUE benchmark and show that varying the noise level gives us a useful tradeoff between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.

[636] Temporal Kolmogorov-Arnold Networks (T-KAN) for High-Frequency Limit Order Book Forecasting: Efficiency, Interpretability, and Alpha Decay

Ahmad Makinde

Main category: cs.LG

TL;DR: T-KAN networks improve HFT predictions by replacing LSTM linear weights with learnable B-spline activations, achieving 19.1% F1-score improvement and 132.48% returns vs DeepLOB’s -82.76% drawdown.

Details

Motivation: High-frequency trading environments have noisy, non-linear limit order book data where traditional models like DeepLOB suffer from alpha decay and lose predictive power over longer time horizons.

Method: Introduces Temporal Kolmogorov-Arnold Networks (T-KAN) that replace fixed linear weights in standard LSTMs with learnable B-spline activation functions, allowing the model to learn the ‘shape’ of market signals rather than just magnitude.

Result: Achieved 19.1% relative improvement in F1-score at k=100 horizon, produced 132.48% return compared to DeepLOB’s -82.76% drawdown under 1.0 bps transaction costs, and demonstrated interpretability with visible ‘dead-zones’ in splines.

Conclusion: T-KAN networks effectively address alpha decay in HFT, provide superior predictive performance and returns, offer interpretability, and are optimized for low-latency FPGA implementation via High Level Synthesis.

Abstract: High-Frequency trading (HFT) environments are characterised by large volumes of limit order book (LOB) data, which is notoriously noisy and non-linear. Alpha decay represents a significant challenge, with traditional models such as DeepLOB losing predictive power as the time horizon (k) increases. In this paper, using data from the FI-2010 dataset, we introduce Temporal Kolmogorov-Arnold Networks (T-KAN) to replace the fixed, linear weights of standard LSTMs with learnable B-spline activation functions. This allows the model to learn the ‘shape’ of market signals as opposed to just their magnitude. This resulted in a 19.1% relative improvement in the F1-score at the k = 100 horizon. The efficacy of T-KAN networks cannot be understated, producing a 132.48% return compared to the -82.76% DeepLOB drawdown under 1.0 bps transaction costs. In addition to this, the T-KAN model proves quite interpretable, with the ‘dead-zones’ being clearly visible in the splines. The T-KAN architecture is also uniquely optimized for low-latency FPGA implementation via High level Synthesis (HLS). The code for the experiments in this project can be found at https://github.com/AhmadMak/Temporal-Kolmogorov-Arnold-Networks-T-KAN-for-High-Frequency-Limit-Order-Book-Forecasting.

[637] Game of Coding: Coding Theory in the Presence of Rational Adversaries, Motivated by Decentralized Machine Learning

Hanzaleh Akbari Nodehi, Viveck R. Cadambe, Mohammad Ali Maddah-Ali

Main category: cs.LG

TL;DR: This paper introduces a game-theoretic framework called “game of coding” that extends coding theory to handle rational adversaries in decentralized systems, particularly for decentralized machine learning, achieving data recovery even when adversarial nodes outnumber honest ones.

Details

Motivation: Traditional coding theory assumes worst-case adversarial models requiring honest nodes to outnumber adversarial ones. However, in emerging decentralized applications like DeML with incentive structures, adversaries behave rationally and strategically rather than purely maliciously, creating a need for new approaches.

Method: The paper introduces a novel game-theoretic framework called “game of coding” that extends coding theory to trust-minimized settings. It focuses on repetition coding and analyzes strategic interactions between rational adversaries and honest nodes in decentralized systems.

Result: The framework achieves two key features: (1) non-zero probability of data recovery even when adversarial nodes are in the majority, and (2) Sybil resistance where the equilibrium remains unchanged as adversarial nodes increase. The paper also explores scenarios with unknown adversary strategies.

Conclusion: The game of coding framework addresses limitations of classical coding theory in decentralized systems with rational adversaries, enabling reliable communication and computation even when honest nodes are not in the majority, with several open problems identified for future research.

Abstract: Coding theory plays a crucial role in enabling reliable communication, storage, and computation. Classical approaches assume a worst-case adversarial model and ensure error correction and data recovery only when the number of honest nodes exceeds the number of adversarial ones by some margin. However, in some emerging decentralized applications, particularly in decentralized machine learning (DeML), participating nodes are rewarded for accepted contributions. This incentive structure naturally gives rise to rational adversaries who act strategically rather than behaving in purely malicious ways. In this paper, we first motivate the need for coding in the presence of rational adversaries, particularly in the context of outsourced computation in decentralized systems. We contrast this need with existing approaches and highlight their limitations. We then introduce the game of coding, a novel game-theoretic framework that extends coding theory to trust-minimized settings where honest nodes are not in the majority. Focusing on repetition coding, we highlight two key features of this framework: (1) the ability to achieve a non-zero probability of data recovery even when adversarial nodes are in the majority, and (2) Sybil resistance, i.e., the equilibrium remains unchanged even as the number of adversarial nodes increases. Finally, we explore scenarios in which the adversary’s strategy is unknown and outline several open problems for future research.

[638] DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.LG

TL;DR: The paper proposes DatBench, a cleaned evaluation suite for vision-language models that addresses critical failures in current benchmarks by improving faithfulness, discriminability, and efficiency.

Details

Motivation: Current evaluation methods for vision-language models have critical flaws: multiple-choice formats reward guessing, benchmarks contain blindly solvable questions (up to 70%), mislabeled samples (up to 42%), and evaluation consumes excessive compute (up to 20% of development resources).

Method: The authors propose three evaluation desiderata (faithfulness, discriminability, efficiency) and curate existing benchmarks through transformation (converting multiple-choice to generative tasks) and filtering (removing blindly solvable and mislabeled samples).

Result: Converting multiple-choice to generative tasks reveals capability drops up to 35%. Filtering improves discriminative power while reducing computational cost. DatBench-Full includes 33 cleaned datasets, while DatBench subset achieves 13x average speedup (up to 50x) while maintaining discriminative power.

Conclusion: The work provides a path toward rigorous and sustainable evaluation practices for scaling VLMs, demonstrating that careful curation of existing benchmarks can dramatically improve evaluation quality and efficiency.

Abstract: Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

[639] Heterogeneous Low-Bandwidth Pre-Training of LLMs

Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky

Main category: cs.LG

TL;DR: SparseLoCo (low-communication data parallelism) can be combined with pipeline model parallelism using activation compression, enabling heterogeneous distributed training with both high-bandwidth and low-bandwidth participants.

Details

Motivation: Scaling LLM pre-training is limited by bandwidth constraints, especially when model parallelism requires frequent large inter-device communications. There's a need to incorporate low-bandwidth participants and combine different parallelism strategies efficiently.

Method: Proposes heterogeneous distributed training framework: high-bandwidth participants host full replicas, while resource-limited participants use pipeline parallelism with subspace-projected inter-stage communication. Adapts subspace pipeline compression to work with SparseLoCo’s sparse pseudo-gradient exchange and infrequent synchronization.

Result: Activation compression composes with SparseLoCo at modest cost. Selective (heterogeneous) compression consistently improves loss-communication tradeoff compared to compressing all replicas, especially at aggressive compression ratios. Tested on 178M-1B parameter models on standard pretraining corpora.

Conclusion: Combining SparseLoCo with low-bandwidth pipeline model parallelism via activation compression provides a practical path to incorporate heterogeneous participants into LLM pre-training, enabling scaling beyond well-provisioned datacenters.

Abstract: Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

[640] Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models

Nouar AlDahoul, Aznul Qalid Md Sabri, Ali Mohammed Mansoor

Main category: cs.LG

TL;DR: The paper proposes using automatic feature learning methods combining optical flow with three deep models (S-CNN, pretrained CNN, and H-ELM) for human detection in aerial videos, achieving high accuracy on the challenging UCF-ARG dataset.

Details

Motivation: Traditional human detection approaches rely on handcrafted features that are problem-dependent, task-specific, and sensitive to dynamic events like illumination changes and camera jitter. Automatic feature learning offers cheaper, easier alternatives that produce abstract discriminative features without expert knowledge.

Method: Combines optical flow with three deep learning models: 1) Supervised Convolutional Neural Network (S-CNN), 2) Pretrained CNN feature extractor, and 3) Hierarchical Extreme Learning Machine (H-ELM). Models are trained and tested on the UCF-ARG aerial dataset with five human actions (digging, waving, throwing, walking, running).

Result: Pretrained CNN achieved highest average accuracy of 98.09%. S-CNN produced 95.6% with softmax and 91.7% with SVM. H-ELM achieved 95.9% accuracy. H-ELM training took 445 seconds on CPU, while S-CNN took 770 seconds on high-performance GPU.

Conclusion: The proposed automatic feature learning methods are successful for human detection in aerial videos, with pretrained CNN performing best. The approaches overcome limitations of traditional handcrafted features and handle challenges of nonstatic cameras and varying altitudes.

Abstract: Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).

[641] LABOR-LLM: Language-Based Occupational Representations with Large Language Models

Susan Athey, Herman Brunborg, Tianyu Du, Ayush Kanodia, Keyon Vafa

Main category: cs.LG

TL;DR: This paper develops an empirical model for predicting workers’ next occupations using occupational history sequences. It fine-tunes a foundation LLM on resume-like text data from representative surveys to estimate transition probabilities, achieving superior predictive performance over prior models.

Details

Motivation: Predicting occupational transitions is challenging due to high-dimensional covariate spaces (occupational histories) and discrete outcomes with many possible values. Traditional models struggle with this complexity, creating a need for more sophisticated approaches to better understand labor market dynamics and career paths.

Method: The method converts tabular survey data about careers into text files resembling resumes, then fine-tunes a large language model foundation model on this data using next-token prediction. The fine-tuned LLM is used to calculate worker transition probabilities, with comparisons between different model sizes and data sources.

Result: The fine-tuned LLM surpasses all prior models in predictive performance for both granular next-occupation prediction and specific tasks like predicting occupation changes or labor force participation. Fine-tuning smaller LLMs with additional career data from different populations can outperform fine-tuning larger models. Predictive performance declines when occupational titles are replaced with unique codes instead of English text.

Conclusion: Fine-tuning foundation LLMs on resume-like career data provides a powerful approach for modeling occupational transitions, outperforming traditional methods. The approach benefits from natural language occupational titles and demonstrates that smaller models with more diverse training data can outperform larger models with less data.

Abstract: This paper builds an empirical model that predicts a worker’s next occupation as a function of the worker’s occupational history. Because histories are sequences of occupations, the covariate space is high-dimensional, and further, the outcome (the next occupation) is a discrete choice that can take on many values. To estimate the parameters of the model, we leverage an approach from generative artificial intelligence. Estimation begins from a foundation model'' trained on non-representative data and then fine-tunes’’ the estimation using data about careers from a representative survey. We convert tabular data from the survey into text files that resemble resumes and fine-tune the parameters of the foundation model, a large language model (LLM), using these text files with the objective of predicting the next token (word). The resulting fine-tuned LLM is used to calculate estimates of worker transition probabilities. Its predictive performance surpasses all prior models, both for the task of granularly predicting the next occupation as well as for specific tasks such as predicting whether the worker changes occupations or stays in the labor force. We quantify the value of fine-tuning and further show that by adding more career data from a different population, fine-tuning smaller LLMs (fewer parameters) surpasses the performance of fine-tuning larger models. When we omit the English language occupational title and replace it with a unique code, predictive performance declines.

[642] A Systematic Survey on Large Language Models for Algorithm Design

Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Kun Mao, Zhichao Lu, Zhenkun Wang, Mingxuan Yuan, Qingfu Zhang

Main category: cs.LG

TL;DR: This paper provides a systematic review of algorithm design with Large Language Models (LLMs), categorizing LLM roles and analyzing progress across algorithmic applications.

Details

Motivation: The rapid integration of LLMs into algorithm design has produced significant progress but lacks a comprehensive systematic review. Existing surveys are limited to narrow sub-fields or have different objectives, hindering holistic understanding of the field.

Method: The authors introduce a taxonomy categorizing LLM roles as optimizers, predictors, extractors, and designers. They systematically review literature across three phases of the algorithm design pipeline and diverse algorithmic applications.

Result: The review provides a structured analysis of progress, advantages, and limitations within each LLM role category, synthesizing the current landscape of algorithm design with LLMs across various domains.

Conclusion: The paper outlines key open challenges and opportunities to guide future research, and provides an accompanying repository (https://github.com/FeiLiu36/LLM4AlgorithmDesign) to support future research and collaboration.

Abstract: Algorithm design is crucial for effective problem-solving across various domains. The advent of Large Language Models (LLMs) has notably enhanced the automation and innovation within this field, offering new perspectives and promising solutions. In just a few years, this integration has yielded remarkable progress in areas ranging from combinatorial optimization to scientific discovery. Despite this rapid expansion, a holistic understanding of the field is hindered by the lack of a systematic review, as existing surveys either remain limited to narrow sub-fields or with different objectives. This paper seeks to provide a systematic review of algorithm design with LLMs. We introduce a taxonomy that categorises the roles of LLMs as optimizers, predictors, extractors and designers, analyzing the progress, advantages, and limitations within each category. We further synthesize literature across the three phases of the algorithm design pipeline and across diverse algorithmic applications that define the current landscape. Finally, we outline key open challenges and opportunities to guide future research. To support future research and collaboration, we provide an accompanying repository at: https://github.com/FeiLiu36/LLM4AlgorithmDesign.

[643] ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zheng Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu

Main category: cs.LG

TL;DR: Differentiable dynamic pruning method converts dense LLM MLP layers to Mixture of Experts architecture to maintain fixed active parameters without permanent removal, outperforming previous structural pruning techniques.

Details

Motivation: LLMs have huge computational and memory costs that make deployment on resource-constrained devices challenging. Previous pruning methods permanently remove model structures, causing substantial performance degradation due to permanent parameter deletion.

Method: Introduces a differentiable dynamic pruning method that converts dense model MLP layers into Mixture of Experts (MoE) architecture. This approach reduces active parameters without permanently removing them, maintaining a fixed number of active parameters during inference.

Result: The method consistently outperforms previous structural pruning techniques across diverse model families including Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5, even without fine-tuning.

Conclusion: Dynamic pruning via MoE conversion provides an effective alternative to permanent structural pruning, reducing computational costs while maintaining better performance compared to previous approaches.

Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities in tackling a wide range of complex tasks. However, their huge computational and memory costs raise significant challenges in deploying these models on resource-constrained devices or efficiently serving them. Prior approaches have attempted to alleviate these problems by permanently removing less important model structures, yet these methods often result in substantial performance degradation due to the permanent deletion of model parameters. In this work, we tried to mitigate this issue by reducing the number of active parameters without permanently removing them. Specifically, we introduce a differentiable dynamic pruning method that pushes dense models to maintain a fixed number of active parameters by converting their MLP layers into a Mixture of Experts (MoE) architecture. Our method, even without fine-tuning, consistently outperforms previous structural pruning techniques across diverse model families, including Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5.

[644] Reasoning Beyond Limits: Advances and Open Problems for LLMs

Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah

Main category: cs.LG

TL;DR: Comprehensive review of 27 top LLMs (2023-2025) covering generative reasoning advancements, multilingual models, state space architectures, and future research challenges.

Details

Motivation: To provide a systematic analysis of recent breakthroughs in generative reasoning LLMs, multilingual capabilities, and efficient state space architectures, identifying core innovations and performance improvements across major models.

Method: Comprehensive review methodology analyzing 27 top LLMs (2023-2025) including DeepSeek-R1, OpenAI o1/o3, GPT-4o, Qwen-32B, Llama variants, Mistral AI Small 3 24B, Search-o1, QwQ-32B, and Phi-4. Covers training strategies: optimization techniques, MoE configurations, RAG, chain-of-thought prompting, self-improvement methods, test-time compute scaling, and distillation frameworks.

Result: Analysis reveals significant enhancements in reasoning capabilities through techniques like inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation. Identifies advancements in multilingual models for cross-lingual reasoning and state space models (Mamba) for efficient long-context processing compared to transformers.

Conclusion: Identifies key future challenges: enabling multi-step reasoning without human supervision, improving robustness in chained task execution, balancing structured prompting with generative flexibility, and enhancing integration of long-context retrieval and external tools.

Abstract: Recent breakthroughs in generative reasoning have fundamentally reshaped how large language models (LLMs) address complex tasks, enabling them to dynamically retrieve, refine, and organize information into coherent multi-step reasoning chains. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been effectively applied to state-of-the-art models, including DeepSeek-R1, OpenAI o1 and o3, GPT-4o, Qwen-32B, and various Llama variants, significantly enhancing their reasoning capabilities. In this paper, we present a comprehensive review of the top 27 LLMs released between 2023 and 2025, such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and Phi-4, and analyze their core innovations and performance improvements. We also provide a detailed overview of recent advancements in multilingual large language models (MLLMs), emphasizing methods that improve cross-lingual reasoning and address the limitations of English-centric training. In parallel, we present a comprehensive review of progress in state space model (SSM)-based architectures, including models such as Mamba, which demonstrate improved efficiency for long-context processing compared to transformer-based approaches. Our analysis covers training strategies including general optimization techniques, mixture-of-experts (MoE) configurations, retrieval-augmented generation (RAG), chain-of-thought prompting, self-improvement methods, and test-time compute scaling and distillation frameworks. Finally, we identify key challenges for future research, including enabling multi-step reasoning without human supervision, improving robustness in chained task execution, balancing structured prompting with generative flexibility, and enhancing the integration of long-context retrieval and external tools.

[645] CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

Yoshihiro Yamada

Main category: cs.LG

TL;DR: Circular-convolutional Attention (CAT) is a Fourier-based Transformer attention mechanism that reduces complexity from O(N²) to O(NlogN) while maintaining accuracy, achieving ~10% speedup with fewer parameters.

Details

Motivation: Standard Transformer attention has O(N²) complexity that limits scalability to longer sequences, creating a need for more efficient attention mechanisms without sacrificing representational power.

Method: CAT uses circular convolutions in the Fourier domain to efficiently compute attention, streamlining fully connected layers to reduce parameters, and operates within the Engineering-Isomorphic Transformers framework.

Result: CAT achieves O(NlogN) complexity, reduces learnable parameters, provides ~10% speedup in naive PyTorch implementations, and maintains or improves accuracy compared to standard attention.

Conclusion: CAT offers practical efficiency with easy implementation, provides insights for future Transformer architectures, and ablation studies reveal key conditions for scalable attention mechanisms.

Abstract: Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully connected layers, and introduces no additional heavy operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations. Based on the Engineering-Isomorphic Transformers (EITs) framework, CAT’s design not only offers practical efficiency and ease of implementation, but also provides insights to guide the development of future high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT’s success, shedding light on broader principles for scalable attention mechanisms.

[646] On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

Junhwa Song, Keumgang Cha, Junghoon Seo

Main category: cs.LG

TL;DR: ROAR and ROAD attribution evaluation metrics have a blurriness bias where less informative attributions can score better, undermining their reliability for assessing feature importance methods.

Details

Motivation: ROAR is widely used to benchmark feature importance/attribution methods in explainable AI, but its reliability needs scrutiny to ensure proper evaluation of explanation quality.

Method: Theoretical analysis and empirical investigations examining the ROAR (RemOve-And-Retrain) procedure and its variant ROAD (RemOve-And-Debias) to assess their reliability in evaluating attribution methods.

Result: Attributions with less information about the decision function can yield superior ROAR benchmark scores, contradicting ROAR’s original intent. This blurriness bias also affects ROAD, suggesting systematic issues with these evaluation metrics.

Conclusion: ROAR metrics have fundamental flaws and should not be used indiscriminately for evaluating attribution methods due to persistent blurriness bias that rewards less informative explanations.

Abstract: Approaches for appraising feature importance approximations, alternatively referred to as attribution methods, have been established across an extensive array of contexts. The development of resilient techniques for performance benchmarking constitutes a critical concern in the sphere of explainable deep learning. This study scrutinizes the dependability of the RemOve-And-Retrain (ROAR) procedure, which is prevalently employed for gauging the performance of feature importance estimates. The insights gleaned from our theoretical foundation and empirical investigations reveal that attributions containing lesser information about the decision function may yield superior results in ROAR benchmarks, contradicting the original intent of ROAR. This occurrence is similarly observed in the recently introduced variant RemOve-And-Debias (ROAD), and we posit a persistent pattern of blurriness bias in ROAR attribution metrics. Our findings serve as a warning against indiscriminate use on ROAR metrics.

[647] Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training

Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud

Main category: cs.LG

TL;DR: BBoxER is an evolutionary black-box optimization method for LLM post-training that provides privacy guarantees, robustness to attacks, and non-vacuous generalization bounds through implicit data compression.

Details

Motivation: Gradient-based optimization in deep learning can leak sensitive information and is vulnerable to data poisoning attacks. Black-box methods offer privacy and security advantages when data access is restricted or adversarial risks are high.

Method: BBoxER uses evolutionary black-box optimization for LLM post-training, creating an information bottleneck via implicit compression of training data. It relies solely on function evaluations rather than gradient access.

Result: BBoxER improves LLM performance with few iterations, generalizes well on reasoning benchmarks, and shows robustness to membership inference attacks. It provides theoretical guarantees for privacy, robustness to data poisoning, and extraction attacks.

Conclusion: BBoxER serves as an attractive add-on to gradient-based optimization, suitable for privacy-sensitive environments while offering non-vacuous generalization guarantees despite scalability challenges of black-box approaches.

Abstract: Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, exposing gradients during training can leak sensitive information about the underlying data, raising privacy and security concerns such as susceptibility to data poisoning attacks. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide non-vacuous generalization bounds and strong theoretical guarantees for privacy, robustness to data poisoning attacks, and extraction attacks. In experiments with LLMs, we demonstrate empirically that black-box optimization methods, despite the scalability and computational challenges inherent to black-box approaches, are able to learn, showing how a few iterations of BBoxER improve performance, generalize well on a benchmark of reasoning datasets, and are robust to membership inference attacks. This positions BBoxER as an attractive add-on on top of gradient-based optimization, offering suitability for deployment in restricted or privacy-sensitive environments while also providing non-vacuous generalization guarantees.

[648] Beyond Expectations: Learning with Stochastic Dominance Made Practical

Shicong Cen, Jincheng Mei, Hanjun Dai, Dale Schuurmans, Yuejie Chi, Bo Dai

Main category: cs.LG

TL;DR: The paper proposes a general framework for machine learning with stochastic dominance, addressing challenges of partial ordering and computational complexity to enable practical applications across various learning tasks.

Details

Motivation: Stochastic dominance provides a powerful framework for modeling decision preferences under uncertainty (including risk aversion), but has seen limited application in machine learning due to two main challenges: 1) it only provides a partial order, making it unsuitable as a general optimality criterion, and 2) computational difficulties due to the continuum nature of evaluating stochastic dominance.

Method: The authors first generalize the stochastic dominance concept to enable feasible comparisons between any arbitrary pair of random variables. They then develop a simple and computationally efficient approach for finding optimal solutions in terms of stochastic dominance that can be seamlessly integrated into various learning tasks.

Result: Numerical experiments show the proposed method achieves comparable performance to standard risk-neutral strategies while obtaining better trade-offs against risk across diverse applications including supervised learning, reinforcement learning, and portfolio optimization.

Conclusion: The work establishes the first general framework for learning with stochastic dominance, overcoming previous limitations and demonstrating practical utility across multiple machine learning domains with improved risk management capabilities.

Abstract: Stochastic dominance serves as a general framework for modeling a broad spectrum of decision preferences under uncertainty, with risk aversion as one notable example, as it naturally captures the intrinsic structure of the underlying uncertainty, in contrast to simply resorting to the expectations. Despite theoretical appeal, the application of stochastic dominance in machine learning has been scarce, due to the following challenges: $\textbf{i)}$, the original concept of stochastic dominance only provides a $\textit{partial order}$, and therefore, is not amenable to serve as a general optimality criterion; and $\textbf{ii)}$, an efficient computational recipe remains lacking due to the continuum nature of evaluating stochastic dominance. In this work, we make the first attempt towards establishing a general framework of learning with stochastic dominance. We first generalize the stochastic dominance concept to enable feasible comparisons between any arbitrary pair of random variables. We next develop a simple and computationally efficient approach for finding the optimal solution in terms of stochastic dominance, which can be seamlessly plugged into many learning tasks. Numerical experiments demonstrate that the proposed method achieves comparable performance as standard risk-neutral strategies and obtains better trade-offs against risk across a variety of applications including supervised learning, reinforcement learning, and portfolio optimization.

[649] Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Main category: cs.LG

TL;DR: Klear-Reasoner is a long-reasoning model that achieves state-of-the-art performance on math and coding benchmarks through careful training methodology including high-quality SFT data and a novel GPPO RL approach.

Details

Motivation: The paper addresses the reproducibility issues in high-performance reasoning models due to incomplete training details disclosure. It aims to provide comprehensive analysis of the full post-training workflow and solve key problems in current RL clipping mechanisms.

Method: The method involves: 1) Data preparation focusing on small number of high-quality sources over diverse sources, 2) Long Chain-of-Thought supervised fine-tuning without accuracy filtering for difficult samples, 3) Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens to address exploration suppression and suboptimal trajectory issues in RL.

Result: Klear-Reasoner achieves exceptional performance: 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5, and 58.1% on LiveCodeBench V6. The GPPO approach enhances exploration capacity and improves learning efficiency from negative samples.

Conclusion: The paper demonstrates that careful deliberation in problem-solving, combined with high-quality data curation and improved RL techniques (GPPO), leads to outstanding reasoning capabilities in mathematics and programming. The comprehensive workflow analysis addresses reproducibility concerns in the community.

Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.

[650] Generation of Geodesics with Actor-Critic Reinforcement Learning to Predict Midpoints

Kazumi Kasaura

Main category: cs.LG

TL;DR: A framework for generating shortest paths on manifolds using recursive midpoint prediction with actor-critic learning, outperforming existing methods on complex planning tasks.

Details

Motivation: Need to find shortest paths for all pairs on manifolds with infinitesimally defined metrics, which is challenging for complex kinematics and multi-DOF robot arms.

Method: Introduce framework to generate shortest paths by predicting midpoints recursively, using actor-critic approach to learn midpoint prediction.

Result: Proved soundness of approach and experimental results show method outperforms existing methods on path planning for agents with complex kinematics and motion planning for multi-DOF robot arms.

Conclusion: The proposed recursive midpoint prediction framework with actor-critic learning is effective for shortest path planning on manifolds with complex constraints.

Abstract: To find the shortest paths for all pairs on manifolds with infinitesimally defined metrics, we introduce a framework to generate them by predicting midpoints recursively. To learn midpoint prediction, we propose an actor-critic approach. We prove the soundness of our approach and show experimentally that the proposed method outperforms existing methods on several planning tasks, including path planning for agents with complex kinematics and motion planning for multi-degree-of-freedom robot arms.

[651] Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space

Xiang Zhang, Kun Wei, Xu Yang, Jiahua Li, Su Yan, Cheng Deng

Main category: cs.LG

TL;DR: RCU is a novel machine unlearning method that uses rotational salience weights and orthogonal rotation axes to enable continuous unlearning without needing retained datasets, preventing cumulative utility loss.

Details

Motivation: LLMs have security vulnerabilities, and existing machine unlearning methods require retained datasets and suffer from cumulative catastrophic utility loss under continuous unlearning requests.

Method: Rotation Control Unlearning (RCU) uses rotational salience weights to quantify unlearning degree, skew symmetric loss to create cognitive rotation space, and orthogonal rotation axes regularization to enforce perpendicular rotation directions for continuous unlearning.

Result: Experiments on multiple datasets show RCU achieves state-of-the-art performance without requiring retained datasets.

Conclusion: RCU effectively addresses the limitations of existing machine unlearning methods by enabling continuous unlearning without retained datasets while preventing cumulative utility loss.

Abstract: As Large Language Models (LLMs) become increasingly prevalent, their security vulnerabilities have already drawn attention. Machine unlearning is introduced to seek to mitigate these risks by removing the influence of undesirable data. However, existing methods not only rely on the retained dataset to preserve model utility, but also suffer from cumulative catastrophic utility loss under continuous unlearning requests. To solve this dilemma, we propose a novel method, called Rotation Control Unlearning (RCU), which leverages the rotational salience weight of RCU to quantify and control the unlearning degree in the continuous unlearning process. The skew symmetric loss is designed to construct the existence of the cognitive rotation space, where the changes of rotational angle can simulate the continuous unlearning process. Furthermore, we design an orthogonal rotation axes regularization to enforce mutually perpendicular rotation directions for continuous unlearning requests, effectively minimizing interference and addressing cumulative catastrophic utility loss. Experiments on multiple datasets confirm that our method without retained dataset achieves SOTA performance.

[652] Posets and Bounded Probabilities for Discovering Order-inducing Features in Event Knowledge Graphs

Christoffer Olling Back, Jakob Grue Simonsen

Main category: cs.LG

TL;DR: The paper presents a probabilistic framework for automated discovery of Event Knowledge Graphs from uncurated data using statistical inference rather than heuristic methods.

Details

Motivation: Event knowledge graphs capture multiple interacting views of process executions, but current approaches rely on manual analysis or heuristic strategies. There's a need for principled, automated EKG discovery from uncurated data.

Method: Develops a probabilistic framing based on feature-derived partial orders on events, with an EKG discovery algorithm using statistical inference. Addresses computational complexity with bound estimates and a branch-and-bound algorithm that prunes search space using antitonic upper bounds.

Result: The approach shows rapid convergence toward optimal solutions that are consistent with manually built EKGs, despite the #P-complete complexity of counting linear extensions of posets.

Conclusion: The paper successfully demonstrates a principled, automated approach to EKG discovery that overcomes computational challenges through statistical inference and efficient search space pruning.

Abstract: Event knowledge graphs (EKG) extend the classical notion of a trace to capture multiple, interacting views of a process execution. In this paper, we tackle the open problem of automating EKG discovery from uncurated data through a principled probabilistic framing based on the outcome space resulting from featured-derived partial orders on events. From this we derive an EKG discovery algorithm based on statistical inference rather than an ad hoc or heuristic-based strategy, or relying on manual analysis from domain experts. This approach comes at the computational cost of exploring a large, non-convex hypothesis space. In particular, solving the maximum likelihood term in our objective function involves counting the number of linear extensions of posets, which in general is #P-complete. Fortunately, bound estimates suffice for model comparison, and admit incorporation into a bespoke branch-and-bound algorithm. We establish an upper bound on our objective function which we show to be antitonic w.r.t. search depth for branching rules that are monotonic w.r.t. model inclusion. This allows pruning of large portions of the search space, which we show experimentally leads to rapid convergence toward optimal solutions that are consistent with manually built EKGs.

[653] Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting

Michael Y. Hu, Benjamin Van Durme, Jacob Andreas, Harsh Jhamtani

Main category: cs.LG

TL;DR: ECHO is a prompting framework that adapts hindsight experience replay for language model agents, generating optimized trajectories from failed attempts to improve sample efficiency in novel environments.

Details

Motivation: LM agents deployed in novel environments have poor sample efficiency when learning from sequential interactions, which is problematic in costly interaction settings like human interaction or physical systems. Existing LM agent architectures make limited use of LMs' abilities to generate or reason about full counterfactual trajectories.

Method: ECHO adapts hindsight experience replay from RL for language model agents. It has two components: 1) a hindsight rule that uses the LM to identify relevant subgoals and generate optimized trajectories for alternative goals from failed attempts, and 2) an update rule that maintains compressed trajectory representations in memory.

Result: ECHO outperforms vanilla language agent baselines by up to 80% across XMiniGrid (text-based navigation/planning) and PeopleJoinQA (collaborative information-gathering simulation). In XMiniGrid, it also outperforms sophisticated architectures like Reflexion and AWM, demonstrating faster adaptation through more effective experience utilization.

Conclusion: ECHO enables language model agents to learn more efficiently from interactions by generating synthetic positive examples from failed attempts, significantly improving sample efficiency and adaptation in novel environments.

Abstract: Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs’ abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.

[654] GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

Main category: cs.LG

TL;DR: GIFT is a novel RL framework for LLM alignment that minimizes discrepancy between implicit and explicit reward models instead of maximizing cumulative rewards, combining ideas from GRPO, DPO, and UNA with joint normalization for convex optimization.

Details

Motivation: Existing RL methods for LLM alignment (like PPO, GRPO) directly maximize cumulative rewards but face challenges with complex optimization, hyperparameter sensitivity, and training overfitting. Offline methods like DPO and UNA lose exploration capability. There's a need for a method that combines the benefits of on-policy exploration with stable, convex optimization.

Method: GIFT combines three key ideas: (1) online multi-response generation and normalization from GRPO, (2) implicit reward formulation from DPO, and (3) implicit-explicit reward alignment principle from UNA. The core innovation is joint normalization of implicit and explicit rewards, which eliminates an intractable term and transforms the optimization into a simple MSE loss between normalized reward functions, making it convex and analytically differentiable.

Result: GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Unlike offline methods, it retains on-policy exploration capability.

Conclusion: GIFT provides a novel approach to LLM alignment that combines the benefits of on-policy methods with stable convex optimization through implicit-explicit reward alignment and joint normalization, offering improved performance, efficiency, and generalization compared to existing methods.

Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.

[655] Towards Unified Approaches in Self-Supervised Event Stream Modeling: Progress and Prospects

Levente Zólyomi, Tianze Wang, Sofiane Ennadir, Oleg Smirnov, Lele Cao

Main category: cs.LG

TL;DR: Survey paper on Self-Supervised Learning for event stream data across multiple domains, providing taxonomy, analysis of techniques, and future research directions.

Details

Motivation: Event stream data from domains like healthcare, e-commerce, gaming, and finance contains valuable insights but faces challenges: scarcity of labeled data and fragmented research. SSL offers promise for extracting meaningful representations from unlabeled event streams.

Method: Systematic review and synthesis of SSL methodologies for event stream modeling across multiple domains. Creates comprehensive taxonomy of SSL techniques (predictive and contrastive paradigms) and analyzes their applicability across different application contexts.

Result: Bridges gaps between domain-specific approaches, unifies disparate research efforts, and highlights cross-domain synergies. Provides framework for understanding SSL applications to event stream data.

Conclusion: Identifies critical research gaps and proposes future agenda for scalable, domain-agnostic SSL frameworks. Aims to accelerate innovation, improve reproducibility, and expand SSL applicability to diverse real-world event stream challenges.

Abstract: The proliferation of digital interactions across diverse domains, such as healthcare, e-commerce, gaming, and finance, has resulted in the generation of vast volumes of event stream (ES) data. ES data comprises continuous sequences of timestamped events that encapsulate detailed contextual information relevant to each domain. While ES data holds significant potential for extracting actionable insights and enhancing decision-making, its effective utilization is hindered by challenges such as the scarcity of labeled data and the fragmented nature of existing research efforts. Self-Supervised Learning (SSL) has emerged as a promising paradigm to address these challenges by enabling the extraction of meaningful representations from unlabeled ES data. In this survey, we systematically review and synthesize SSL methodologies tailored for ES modeling across multiple domains, bridging the gaps between domain-specific approaches that have traditionally operated in isolation. We present a comprehensive taxonomy of SSL techniques, encompassing both predictive and contrastive paradigms, and analyze their applicability and effectiveness within different application contexts. Furthermore, we identify critical gaps in current research and propose a future research agenda aimed at developing scalable, domain-agnostic SSL frameworks for ES modeling. By unifying disparate research efforts and highlighting cross-domain synergies, this survey aims to accelerate innovation, improve reproducibility, and expand the applicability of SSL to diverse real-world ES challenges.

[656] How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

Main category: cs.LG

TL;DR: A framework to correct bias and provide confidence intervals for LLM-based evaluation using calibration data, with adaptive calibration to reduce uncertainty.

Details

Motivation: LLMs are widely used as scalable evaluators but their imperfect sensitivity and specificity induce bias in evaluation scores, requiring statistical correction.

Method: Proposes a plug-in framework that corrects bias using human-evaluated calibration data, constructs confidence intervals, and introduces adaptive calibration strategy to reduce uncertainty.

Result: The framework enables statistically sound LLM-based evaluation, characterizes regimes where it outperforms fully human evaluation, and shows robustness to distribution shift between test and calibration datasets.

Conclusion: Provides a practical and statistically rigorous approach to LLM-based evaluation that corrects bias, quantifies uncertainty, and can be more reliable than human evaluation in certain scenarios.

Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of LLM judgments induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and constructs confidence intervals accounting for uncertainty from both the test dataset and a human-evaluated calibration dataset, enabling statistically sound and practical LLM-based evaluation. Building on this framework, we introduce an adaptive calibration strategy for constructing the calibration dataset to reduce uncertainty in the estimated score. Notably, we characterize the regimes in which LLM-based evaluation within our framework produces more reliable estimates than fully human evaluation. Moreover, our framework is more robust to distribution shift between the test and calibration datasets than existing approaches.

[657] Language as a Wave Phenomenon: Iso-Energetic Phase-Locking and Semantic Interference in Neural Networks

Alper Yıldırım, İbrahim Yücedağ

Main category: cs.LG

TL;DR: PRISM is a phase-based sequence modeling architecture that enables passive photonic hardware to perform high-level reasoning by encoding information exclusively in phase angles while maintaining energy constraints.

Details

Motivation: Conventional deep learning uses magnitude-based representations that are metabolically expensive and incompatible with passive photonic hardware. There's a need for architectures that can bridge high-level reasoning with physical constraints of ultra-low-power optical systems.

Method: PRISM architecture enforces an Iso-Energetic (Unity Gain) principle, forcing the network to encode semantic information exclusively in phase angles. The method includes Holographic Backpropagation mechanism simulated on a noisy, 4-bit optical correlator, with phase-steering that actively optimizes physical parameters under strict energy constraints.

Result: On WMT14 translation benchmark: PRISM achieves 0.799 COMET score, competing with standard Transformers (0.821) and matching unconstrained spectral baselines like FNet (0.805), while using 11.5% fewer parameters. Hardware simulation shows 48.4% vs. 62.4% performance gain over frozen baseline, proving active optimization of physical parameters.

Conclusion: PRISM establishes an existence proof that ultra-low-power, passive optical hardware can support high-level linguistic intelligence without sacrificing representational capacity, bridging the gap between computational requirements and physical hardware constraints.

Abstract: Conventional deep learning paradigms rely on metabolically expensive magnitude-based representations, rendering them fundamentally incompatible with passive photonic hardware. We introduce PRISM, a sequence modeling architecture that bridges high-level reasoning and physical constraints by enforcing an Iso-Energetic (Unity Gain) principle, compelling the network to encode semantic information exclusively in the phase angle. Validated on the WMT14 translation benchmark, PRISM achieves a 0.799 COMET score, demonstrating that phase-based reasoning competes with standard Transformers (0.821) and functionally matches unconstrained spectral baselines like FNet (0.805), despite enforcing strict energy constraints and requiring 11.5% fewer parameters. Furthermore, to verify hardware feasibility, we simulate a Holographic Backpropagation mechanism on a noisy, 4-bit optical correlator. Ablation studies reveal a substantial performance gain (48.4% vs. 62.4%) over a frozen baseline, proving that the proposed phase-steering mechanism actively optimizes physical parameters under strict energy constraints. These results establish an existence proof that ultra-low-power, passive optical hardware can support high-level linguistic intelligence without sacrificing representational capacity.

[658] SIP-BMM: Constructing the Capability–Efficiency Pareto Set for LLMs via Structural Importance Prior Bayesian Model Merging

Kesheng Chen, Yamin Hu, Zhenqian Zhu, Wenjian Luo, Yiya Diao

Main category: cs.LG

TL;DR: SIP-BMM is a framework that automatically constructs Pareto sets for LLMs by combining structural importance priors with sparse Bayesian optimization to efficiently navigate capability-efficiency trade-offs.

Details

Motivation: Existing model merging techniques are inadequate for constructing Pareto sets - coarse-grained methods yield sparse suboptimal solutions, while fine-grained layer-wise approaches suffer from computational intractability due to high dimensionality.

Method: Structural Importance Prior Bayesian Model Merging (SIP-BMM) uses importance-aware Sparse Axis-Aligned Subspace Bayesian Optimization (SAASBO) guided by structural importance priors derived from task-vector differences to identify critical layers, reducing effective dimensionality while maintaining full-model control granularity.

Result: SIP-BMM discovers stronger and denser Pareto fronts than competitive baselines, enabling agile model selection tailored to diverse operational constraints.

Conclusion: SIP-BMM provides an automated framework for constructing LLM Pareto sets that resolves the dichotomy between coarse-grained and fine-grained merging approaches, making high-dimensional layer-wise search tractable through structural importance priors and sparse Bayesian optimization.

Abstract: Constructing a Pareto set is pivotal for navigating the capability–efficiency trade-offs in Large Language Models (LLMs). However, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the curse of dimensionality, rendering the search space computationally intractable. To resolve this dichotomy, we propose Structural Importance Prior Bayesian Model Merging (SIP-BMM), a framework that automatically constructs the LLM Pareto set. SIP-BMM renders high-dimensional layer-wise search tractable by introducing an importance-aware Sparse Axis-Aligned Subspace Bayesian Optimization (SAASBO) strategy. By leveraging a structural importance prior derived from task-vector differences, our method guides SAASBO to automatically identify critical layers, thereby dramatically reducing the effective dimensionality without sacrificing the granularity of full-model control. The entire process is automated within an evolutionary loop driven by the Log-Noisy Expected Hypervolume Improvement ($q$NEHVI) acquisition function. Experiments demonstrate that SIP-BMM discovers a stronger and denser Pareto front than competitive baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/MiLab-HITSZ/2026-SIPBMM.

[659] AdvKT: An Adversarial Multi-Step Training Framework for Knowledge Tracing

Lingyue Fu, Ting Long, Jianghao Lin, Wei Xia, Xinyi Dai, Ruiming Tang, Yasheng Wang, Weinan Zhang, Yong Yu

Main category: cs.LG

TL;DR: AdvKT introduces an adversarial multi-step training framework for knowledge tracing that addresses error accumulation and data sparsity through adversarial learning and data augmentation.

Details

Motivation: Existing KT models use single-step training but require multi-step inference in real-world applications, causing error accumulation. Combined with data sparsity, this degrades recommendation performance in intelligent tutoring systems.

Method: Proposes AdvKT with adversarial learning (generator and discriminator). Generator mimics high-reward responses to reduce multi-step error accumulation, discriminator provides feedback for synthetic data generation. Includes specialized data augmentation techniques for realistic variations.

Result: Experiments on four real-world datasets demonstrate AdvKT’s superiority over existing KT models in addressing both error accumulation and data sparsity issues.

Conclusion: AdvKT effectively tackles the critical challenges of error accumulation and data sparsity in knowledge tracing through its novel adversarial multi-step training framework.

Abstract: Knowledge Tracing (KT) monitors students’ knowledge states and simulates their responses to question sequences. Existing KT models typically follow a single-step training paradigm, which leads to discrepancies with the multi-step inference process required in real-world simulations, resulting in significant error accumulation. This accumulation of error, coupled with the issue of data sparsity, can substantially degrade the performance of recommendation models in the intelligent tutoring systems. To address these challenges, we propose a novel Adversarial Multi-Step Training Framework for Knowledge Tracing (AdvKT), which, for the first time, focuses on the multi-step KT task. More specifically, AdvKT leverages adversarial learning paradigm involving a generator and a discriminator. The generator mimics high-reward responses, effectively reducing error accumulation across multiple steps, while the discriminator provides feedback to generate synthetic data. Additionally, we design specialized data augmentation techniques to enrich the training data with realistic variations, ensuring that the model generalizes well even in scenarios with sparse data. Experiments conducted on four real-world datasets demonstrate the superiority of AdvKT over existing KT models, showcasing its ability to address both error accumulation and data sparsity issues effectively.

[660] Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang

Main category: cs.LG

TL;DR: LLM inference scheduling algorithm (WAIT) uses threshold-based batching to prevent KV cache eviction and memory overflow, achieving near-optimal throughput with known output lengths, and Nested WAIT handles unknown lengths via on-the-fly classification.

Details

Motivation: LLM inference poses unique scheduling challenges due to dynamic KV cache growth that can cause memory overflow and system failures. Conventional scheduling fails because it doesn't account for this dynamic memory growth, making systems unstable even with sufficient theoretical capacity.

Method: Formulate LLM inference as multi-stage online scheduling problem. Develop fluid dynamics approximation for tractable benchmark. WAIT algorithm uses threshold-based batching to prevent eviction by keeping system near load balance. Nested WAIT handles unknown output lengths by classifying prompts on-the-fly: short prompts exit early, longer prompts advance to later segments with safety buffer for memory protection.

Result: Theoretical analysis shows near-optimal performance in asymptotic regime. Experiments on Llama-7B with A100 GPU demonstrate superior throughput and reduced latency compared to vLLM and Sarathi.

Conclusion: This work applies operations research principles to establish theoretical framework for LLM deployment under memory constraints, providing effective scheduling solutions for dynamic KV cache management.

Abstract: Large Language Models (LLMs) power many modern applications, but their inference procedure poses unique scheduling challenges: the Key-Value (KV) cache grows dynamically during response generation, and memory overflow triggers eviction that can cascade into system-wide failures. Even when memory capacity exceeds the theoretical requirement, conventional scheduling algorithms fail because they do not account for this dynamic memory growth – a system that should be stable can become unstable under poor scheduling. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to establish a tractable benchmark and derive the Waiting for Accumulated Inference Threshold (WAIT) algorithm. WAIT uses threshold-based batching to prevent eviction by keeping the system near load balance, achieving near-optimal throughput when output lengths are known. For practical settings where output lengths are unknown at arrival, we introduce Nested WAIT. Rather than predicting output lengths, Nested WAIT classifies prompts on-the-fly: short prompts complete early and exit, while longer prompts naturally advance to later segments. A safety buffer provides high-probability protection against memory overflow with only logarithmic overhead. Theoretical analysis establishes near-optimal performance in the asymptotic regime. Experiments on Llama-7B with an A100 GPU demonstrate that our approach achieves superior throughput and reduced latency compared to vLLM and Sarathi. This work applies operations research principles to establish a theoretical framework for LLM deployment under memory constraints.

[661] NoveltyRank: A Retrieval-Augmented Framework for Conceptual Novelty Estimation in AI Research

Zhengxu Yan, Han Li, Yuming Feng

Main category: cs.LG

TL;DR: A framework for estimating conceptual novelty of research papers using semantic representation learning and retrieval-based comparison against prior literature, with models outperforming larger zero-shot approaches through task-specific fine-tuning.

Details

Motivation: The accelerating pace of scientific publication makes it difficult to identify truly original research among incremental work, necessitating automated methods to assess conceptual novelty.

Method: Combines semantic representation learning with retrieval-based comparison against prior literature, modeling novelty as both binary classification (novel vs. non-novel) and pairwise ranking (comparative novelty). Experiments benchmark three model scales from compact domain-specific encoders to zero-shot frontier models.

Result: Fine-tuned lightweight models outperform larger zero-shot models despite smaller parameter counts, indicating task-specific supervision matters more than scale for conceptual novelty estimation. The best-performing model is deployed as an online system for public interaction and real-time novelty scoring.

Conclusion: Task-specific supervision is more important than model scale for conceptual novelty estimation, and the framework enables both absolute and relative novelty assessments with practical deployment as an online system.

Abstract: The accelerating pace of scientific publication makes it difficult to identify truly original research among incremental work. We propose a framework for estimating the conceptual novelty of research papers by combining semantic representation learning with retrieval-based comparison against prior literature. We model novelty as both a binary classification task (novel vs. non-novel) and a pairwise ranking task (comparative novelty), enabling absolute and relative assessments. Experiments benchmark three model scales, ranging from compact domain-specific encoders to a zero-shot frontier model. Results show that fine-tuned lightweight models outperform larger zero-shot models despite their smaller parameter count, indicating that task-specific supervision matters more than scale for conceptual novelty estimation. We further deploy the best-performing model as an online system for public interaction and real-time novelty scoring.

[662] AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai

Main category: cs.LG

TL;DR: AFA-LoRA enhances LoRA’s expressive power by introducing an annealed activation function that transitions from non-linear to linear during training, bridging the gap between linear and non-linear fine-tuning while maintaining mergeability.

Details

Motivation: LoRA's linear adaptation process limits its expressive power, creating a gap between linear and non-linear training approaches. The authors aim to enhance LoRA's capabilities while preserving its practical advantages like seamless mergeability.

Method: Proposes AFA-LoRA with an annealed activation function that starts as non-linear (for stronger representational power) and gradually transitions to linear during training, allowing the adapter to converge to a mergeable linear form while initially benefiting from non-linear expressivity.

Result: AFA-LoRA reduces the performance gap between LoRA and full-parameter training across multiple applications including supervised fine-tuning, reinforcement learning, and speculative decoding.

Conclusion: The work enables a more powerful and practical paradigm of parameter-efficient adaptation by bringing non-linear expressivity to LoRA while maintaining its mergeability advantages.

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.

[663] Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning

Shangzhe Li, Zhiao Huang, Hao Su

Main category: cs.LG

TL;DR: A novel online imitation learning method using random network distillation (RND) for reward modeling that improves stability over adversarial approaches while achieving expert-level performance.

Details

Motivation: Existing imitation learning methods often face instability challenges, particularly when using adversarial reward or value formulations in world model frameworks, which limits their practical application.

Method: Proposes a reward model based on random network distillation (RND) for density estimation, built on joint estimation of expert and behavioral distributions within the latent space of the world model.

Result: The method achieves stable performance and expert-level results across diverse benchmarks including DMControl, Meta-World, and ManiSkill2 for both locomotion and manipulation tasks.

Conclusion: The RND-based approach demonstrates improved stability over adversarial methods while maintaining expert-level performance, addressing key limitations in current imitation learning techniques.

Abstract: Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta-World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert-level results in both locomotion and manipulation tasks. Our approach demonstrates improved stability over adversarial methods while maintaining expert-level performance.

[664] Sample Path Regularity of Gaussian Processes from the Covariance Kernel

Nathaël Da Costa, Marvin Pförtner, Lancelot Da Costa, Philipp Hennig

Main category: cs.LG

TL;DR: The paper establishes necessary and sufficient conditions on covariance kernels for Gaussian process sample paths to achieve specific Hölder regularity, with simplified conditions for stationary/isotropic GPs and applications to common kernels like Matérn.

Details

Motivation: While Gaussian processes are widely used for probability distributions over function spaces, there's a lack of comprehensive understanding about the regularity of their sample paths. Current practice defines GPs through mean functions and covariance kernels rather than probability measures, creating a gap in understanding path regularity.

Method: The authors develop mathematical conditions on covariance kernels that determine the Hölder regularity of GP sample paths. They focus on Hölder regularity because it yields particularly straightforward conditions, which further simplify for stationary and isotropic Gaussian processes.

Result: The paper provides novel and unusually tight characterizations of sample path regularities for commonly used GPs in machine learning, particularly demonstrating these results for Matérn Gaussian processes.

Conclusion: The research bridges the gap between covariance kernel specifications and the resulting regularity properties of GP sample paths, offering practical tools for understanding and selecting appropriate kernels based on desired function space regularity in machine learning applications.

Abstract: Gaussian processes (GPs) are the most common formalism for defining probability distributions over spaces of functions. While applications of GPs are myriad, a comprehensive understanding of GP sample paths, i.e. the function spaces over which they define a probability measure, is lacking. In practice, GPs are not constructed through a probability measure, but instead through a mean function and a covariance kernel. In this paper we provide necessary and sufficient conditions on the covariance kernel for the sample paths of the corresponding GP to attain a given regularity. We focus primarily on Hölder regularity as it grants particularly straightforward conditions, which simplify further in the cases of stationary and isotropic GPs. We then demonstrate that our results allow for novel and unusually tight characterisations of the sample path regularities of the GPs commonly used in machine learning applications, such as the Matérn GPs.

[665] Balancing Fidelity and Plasticity: Aligning Mixed-Precision Fine-Tuning with Linguistic Hierarchies

Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Qian Qiao, Jun Gao, Shichao Weng, Weizhong Zhang, Cheng Jin

Main category: cs.LG

TL;DR: QR-Adaptor is a unified framework that jointly optimizes per-layer quantization bit-width and LoRA rank to address the fidelity-plasticity trade-off in LLM fine-tuning on edge devices, achieving 4-bit performance rivaling 16-bit baselines.

Details

Motivation: Current quantization-aware fine-tuning methods decouple quantization and adapter optimization, overlooking the fidelity-plasticity trade-off: a layer's adaptation capacity (plasticity) is constrained by the information capacity of its frozen weights (fidelity). Aggressive quantization of critical layers creates information bottlenecks, while high precision in robust layers wastes memory on edge devices.

Method: QR-Adaptor jointly optimizes per-layer quantization bit-width and LoRA rank allocation. It formulates resource allocation as a multi-objective search aligned with the model’s linguistic hierarchy, systematically reallocating memory from redundancy-heavy layers to capacity-critical ones.

Result: Extensive experiments show QR-Adaptor establishes a new Pareto frontier. Notably, models fine-tuned under a strict 4-bit memory budget achieve performance rivaling 16-bit baselines, demonstrating that precise resource alignment is as critical as model size.

Conclusion: The fidelity-plasticity trade-off is fundamental in LLM fine-tuning, and QR-Adaptor’s unified optimization of quantization and adapter parameters enables efficient edge deployment by aligning resource allocation with the model’s linguistic hierarchy.

Abstract: Deploying and fine-tuning Large Language Models (LLMs) on resource-constrained edge devices requires navigating a strict trade-off between memory footprint and task performance. While Quantization-Aware Fine-tuning has emerged as a viable solution, existing paradigms typically decouple quantization and adapter optimization. This separation overlooks a fundamental theoretical constraint we identify as the \textit{Fidelity-Plasticity Trade-off}: a layer’s capacity to adapt to new tasks (Plasticity) is inherently constrained by the information capacity of its frozen weights (Fidelity). Aggressively quantizing semantically critical layers creates an information bottleneck that no amount of adapter rank can recover, while high precision in robust syntactic layers wastes valuable memory. To address this, we introduce \textbf{QR-Adaptor}, a unified framework that jointly optimizes per-layer quantization bit-width and LoRA rank. By formulating resource allocation as a multi-objective search aligned with the model’s linguistic hierarchy, our method systematically liberates memory from redundancy-heavy layers to reinvest in capacity-critical ones. Extensive experiments demonstrate that QR-Adaptor establishes a new Pareto frontier: notably, a model fine-tuned under a strict 4-bit memory budget achieves performance rivaling 16-bit baselines, demonstrating that precise resource alignment is as critical as model size.

[666] Scaling Open-Ended Reasoning to Predict the Future

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

Main category: cs.LG

TL;DR: Researchers trained language models for open-ended forecasting using automated news-based data generation, achieving state-of-the-art performance with their 8B parameter model that matches larger proprietary models.

Details

Motivation: High-stakes decision making requires reasoning under uncertainty about future events. The paper aims to develop language models capable of making accurate predictions on open-ended forecasting questions, addressing the need for scalable training data and preventing information leakage.

Method: Used automated curation to synthesize forecasting questions from daily news events, trained Qwen3 thinking models on OpenForesight dataset, employed offline news corpus to prevent future information leakage, implemented retrieval mechanisms, and developed improved reward functions for reinforcement learning.

Result: OpenForecaster 8B model matches performance of much larger proprietary models, showing improvements in accuracy, calibration, and consistency. Calibration improvements from forecasting training generalize across popular benchmarks. All models, code, and data are open-sourced.

Conclusion: The research demonstrates that specialized forecasting models can achieve competitive performance through automated data generation and careful training methodologies, making language model forecasting research more accessible through open-source release.

Abstract: High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

[667] Stochastic Online Optimization for Cyber-Physical and Robotic Systems

Hao Ma, Melanie Zeilinger, Michael Muehlebach

Main category: cs.LG

TL;DR: Novel gradient-based online optimization framework for stochastic programming in cyber-physical systems with constraints, partial observability, and approximate dynamics models, with convergence analysis and experimental validation.

Details

Motivation: Address stochastic programming problems in cyber-physical/robotic systems that have continuous state-action spaces, nonlinear dynamics, partial observability, and need to incorporate prior knowledge of approximate dynamics models to improve learning.

Method: Gradient-based online optimization framework encompassing gradient descent and quasi-Newton methods, incorporating approximate dynamics models as prior knowledge, with unified convergence analysis for non-convex settings and characterization of modeling error impacts.

Result: Shows that even rough estimates of dynamics significantly improve algorithm convergence, provides convergence analysis in non-convex settings, characterizes modeling error impacts, and validates algorithms in simulations (flexible beam, walking robot) and real-world experiments (ping-pong robot).

Conclusion: Proposed framework effectively solves stochastic programming for cyber-physical systems by leveraging approximate dynamics models, with proven convergence properties and practical validation across diverse robotic applications.

Abstract: We propose a novel gradient-based online optimization framework for solving stochastic programming problems that frequently arise in the context of cyber-physical and robotic systems. Our problem formulation accommodates constraints that model the evolution of a cyber-physical system, which has, in general, a continuous state and action space, is nonlinear, and where the state is only partially observed. We also incorporate an approximate model of the dynamics as prior knowledge into the learning process and show that even rough estimates of the dynamics can significantly improve the convergence of our algorithms. Our online optimization framework encompasses both gradient descent and quasi-Newton methods, and we provide a unified convergence analysis of our algorithms in a non-convex setting. We also characterize the impact of modeling errors in the system dynamics on the convergence rate of the algorithms. Finally, we evaluate our algorithms in simulations of a flexible beam, a four-legged walking robot, and in real-world experiments with a ping-pong playing robot.

[668] SinBasis Networks: Matrix-Equivalent Feature Extraction for Wave-Like Optical Spectrograms

Yuzhou Zhu, Zheng Zhang, Ruyi Zhang, Liang Zhou

Main category: cs.LG

TL;DR: Sin-Basis Networks: A unified framework using sinusoidal weight reparametrization to enhance deep learning models’ sensitivity to periodic patterns in wave-like images across diverse domains.

Details

Motivation: Wave-like images (attosecond streaking spectrograms, optical spectra, audio mel-spectrograms, periodic video frames) contain critical harmonic structures that conventional feature extractors fail to capture effectively.

Method: Reinterpret convolution and attention as linear transforms on flattened inputs, revealing filter weights as basis vectors. Apply elementwise sin(·) mappings to weight matrices to infuse spectral priors. Embed these transforms into CNN, ViT, and Capsule architectures to create Sin-Basis Networks.

Result: Experiments on diverse wave-like image datasets (80k synthetic attosecond streaking spectrograms, Raman/PL/FTIR spectra, AudioSet mel-spectrograms, Kinetics video frames) show substantial gains in reconstruction accuracy, translational robustness, and zero-shot cross-domain transfer.

Conclusion: Sin-Basis Networks offer a lightweight, physics-informed approach to deep learning across all wave-form imaging modalities, with theoretical analysis showing enriched expressivity while preserving stability in data-scarce regimes.

Abstract: Wave-like images–from attosecond streaking spectrograms to optical spectra, audio mel-spectrograms and periodic video frames–encode critical harmonic structures that elude conventional feature extractors. We propose a unified, matrix-equivalent framework that reinterprets convolution and attention as linear transforms on flattened inputs, revealing filter weights as basis vectors spanning latent feature subspaces. To infuse spectral priors we apply elementwise (\sin(\cdot)) mappings to each weight matrix. Embedding these transforms into CNN, ViT and Capsule architectures yields Sin-Basis Networks with heightened sensitivity to periodic motifs and built-in invariance to spatial shifts. Experiments on a diverse collection of wave-like image datasets–including 80,000 synthetic attosecond streaking spectrograms, thousands of Raman, photoluminescence and FTIR spectra, mel-spectrograms from AudioSet and cycle-pattern frames from Kinetics–demonstrate substantial gains in reconstruction accuracy, translational robustness and zero-shot cross-domain transfer. Theoretical analysis via matrix isomorphism and Mercer-kernel truncation quantifies how sinusoidal reparametrization enriches expressivity while preserving stability in data-scarce regimes. Sin-Basis Networks thus offer a lightweight, physics-informed approach to deep learning across all wave-form imaging modalities.

[669] Echo State Networks for Spatio-Temporal Area-Level Data

Zhenhua Wang, Scott H. Holan, Christopher K. Wikle

Main category: cs.LG

TL;DR: Proposes enhancing Echo State Networks with graph spectral filters to incorporate spatial dependencies in spatio-temporal area-level data forecasting, improving accuracy while maintaining computational efficiency.

Details

Motivation: Spatio-temporal area-level datasets are crucial for policy-making but existing Echo State Networks lack mechanisms to account for spatial neighborhood structures, which compromises forecast accuracy when ignored.

Method: Incorporates approximate graph spectral filters at the input stage of Echo State Networks to capture spatial relationships while preserving the model’s computational efficiency during training.

Result: Demonstrates effectiveness using Eurostat’s tourism occupancy dataset, showing improved forecast accuracy that supports more informed decision-making in policy and planning contexts.

Conclusion: The proposed approach successfully integrates spatial dependencies into ESNs, enhancing forecasting accuracy for area-level data while maintaining computational efficiency, making it valuable for policy and planning applications.

Abstract: Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating forecasts. However, ESNs lack a direct mechanism to account for the neighborhood structure inherent in area-level data. Ignoring these spatial relationships can significantly compromise the accuracy and utility of forecasts. In this paper, we incorporate approximate graph spectral filters at the input stage of the ESN, thereby improving forecast accuracy while preserving the model’s computational efficiency during training. We demonstrate the effectiveness of our approach using Eurostat’s tourism occupancy dataset and show how it can support more informed decision-making in policy and planning contexts.

[670] Learning Repetition-Invariant Representations for Polymer Informatics

Yihan Zhu, Gang Liu, Eric Inae, Tengfei Luo, Meng Jiang

Main category: cs.LG

TL;DR: GRIN is a graph neural network method that learns polymer representations invariant to the number of repeating units, addressing limitations of existing methods that only model single units.

Details

Motivation: Existing graph neural networks for polymers only model single units and fail to produce consistent representations for true polymer structures with varying numbers of repeating units, limiting their effectiveness for polymer applications.

Method: GRIN integrates graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency, with theoretical guarantees showing three repeating units are minimal for optimal invariant representation learning.

Result: GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.

Conclusion: GRIN provides a novel solution for learning polymer representations that are invariant to repeating unit counts, enabling better generalization and performance in polymer applications across various fields.

Abstract: Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.

[671] Relaxed Equivariance via Multitask Learning

Ahmed A. Elhag, T. Konstantin Rusch, Francesco Di Giovanni, Michael Bronstein

Main category: cs.LG

TL;DR: REMUL introduces a training procedure that learns approximate equivariance for unconstrained networks via multitask learning, offering tunable control over symmetry constraints while maintaining computational efficiency.

Details

Motivation: Strictly equivariant models have computational complexity challenges, but roto-translational equivariance is crucial for modeling geometric graphs and molecules where 3D structure understanding enhances generalization.

Method: REMUL formulates equivariance as a tunable objective alongside primary task loss using multitask learning. Unconstrained models learn approximate symmetries by minimizing an additional simple equivariance loss, enabling quantitative control over the trade-off between equivariance constraints and task performance.

Result: The method achieves competitive performance compared to equivariant baselines while being significantly faster (up to 10× at inference and 2.5× at training), offering a practical approach to leveraging symmetry in unconstrained architectures.

Conclusion: REMUL provides a principled way to control approximate symmetry, relaxing rigid constraints of traditional equivariant architectures while maintaining computational efficiency and competitive performance.

Abstract: Incorporating equivariance as an inductive bias into deep learning architectures to take advantage of the data symmetry has been successful in multiple applications, such as chemistry and dynamical systems. In particular, roto-translations are crucial for effectively modeling geometric graphs and molecules, where understanding the 3D structures enhances generalization. However, strictly equivariant models often pose challenges due to their higher computational complexity. In this paper, we introduce REMUL, a training procedure that learns \emph{approximate} equivariance for unconstrained networks via multitask learning. By formulating equivariance as a tunable objective alongside the primary task loss, REMUL offers a principled way to control the degree of approximate symmetry, relaxing the rigid constraints of traditional equivariant architectures. We show that unconstrained models (which do not build equivariance into the architecture) can learn approximate symmetries by minimizing an additional simple equivariance loss. This enables quantitative control over the trade-off between enforcing equivariance constraints and optimizing for task-specific performance. Our method achieves competitive performance compared to equivariant baselines while being significantly faster (up to 10$\times$ at inference and 2.5$\times$ at training), offering a practical and adaptable approach to leveraging symmetry in unconstrained architectures.

[672] ManiBox: Enhancing Embodied Spatial Generalization via Scalable Simulation Data Generations

Hengkai Tan, Xuezhou Xu, Chengyang Ying, Xinyi Mao, Zeyuan Wang, Songming Liu, Xingxing Zhang, Zhizhong Su, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: ManiBox is a bounding-box-guided framework that decouples perception from policy generalization to reduce Sim2Real gap, leveraging scalable simulation data and enabling zero-shot transfer to real robots.

Details

Motivation: Embodied agents need robust spatial intelligence for precise real-world manipulations, but current methods struggle with accurate object positioning. Real robot data collection is expensive, while simulation data causes visual generalization gaps in real-world deployment.

Method: ManiBox uses an RL teacher policy to generate scalable simulation data, then distills a student policy that takes bounding boxes as input (sufficient for spatial positioning), enabling zero-shot transfer to real robots.

Result: ManiBox demonstrates strong spatial generalization and adaptability across various manipulation tasks in both simulated and real-world environments, with empirical verification of spatial scaling laws showing power-law relationships between data requirements and spatial volume.

Conclusion: The bounding-box-guided approach effectively reduces Sim2Real gap, leverages Internet-scale data, and enables scalable policy data collection while providing insights into spatial scaling laws for manipulation tasks.

Abstract: Embodied agents require robust spatial intelligence to execute precise real-world manipulations. However, this remains a significant challenge, as current methods often struggle to accurately position objects in space. Collecting extensive data can help address this issue by enhancing the agent’s spatial understanding. Nonetheless, obtaining such data with real robots is prohibitively expensive, and relying on simulation data frequently leads to visual generalization gaps during real-world deployment. To tackle these challenges, we propose ManiBox, a novel bounding-box-guided framework. By decoupling perception from policy generalization, ManiBox effectively reduces the Sim2Real gap, leverages Internet-scale data, and scales our policy data collection in simulation. Specifically, within ManiBox, the RL teacher policy efficiently generates scalable simulation data. The student policy is distilled from this data and takes bounding boxes as input, which is proven sufficient for determining objects’ spatial positions, thus enabling zero-shot transfer to real robots. Comprehensive evaluations in both simulated and real-world environments demonstrate that ManiBox exhibits strong spatial generalization and adaptability across various manipulation tasks and settings. Furthermore, our empirical study provides preliminary verification of spatial scaling laws, i.e., the amount of data required for spatial generalization scales with spatial volume following a power-law relationship. At a given spatial volume level, the success rate of manipulation tasks follows Michaelis-Menten kinetics with respect to data volume, exhibiting a saturation effect as data increases. Our videos and code are available at https://thkkk.github.io/manibox

[673] “FRAME: Forward Recursive Adaptive Model Extraction-A Technique for Advance Feature Selection”

Nachiket Kapure, Harsh Joshi, Parul Kumari, Rajeshwari Mistri, Manasi Mali

Main category: cs.LG

TL;DR: FRAME is a novel hybrid feature selection method combining Forward Selection and Recursive Feature Elimination to balance accuracy, interpretability, and computational efficiency in machine learning.

Details

Motivation: Addressing the critical challenge of balancing model accuracy, interpretability, and computational efficiency in feature selection, which remains a key issue for advancing machine learning methodologies.

Method: Forward Recursive Adaptive Model Extraction Technique (FRAME) - a hybrid approach combining Forward Selection (for exploration) with Recursive Feature Elimination (for refinement) to systematically identify optimal feature subsets.

Result: FRAME consistently delivers superior predictive performance compared to traditional methods (SelectKBest, Lasso Regression) on high-dimensional, noisy, and heterogeneous datasets, efficiently performing dimensionality reduction while maintaining strong model performance.

Conclusion: FRAME offers a practical solution for interpretable and accurate predictions, particularly useful in domains like biomedical diagnostics, and has potential for further development with deep learning frameworks for adaptive, real-time feature selection in dynamic settings.

Abstract: The challenges in feature selection, particularly in balancing model accuracy, interpretability, and computational efficiency, remain a critical issue in advancing machine learning methodologies. To address these complexities, this study introduces a novel hybrid approach, the Forward Recursive Adaptive Model Extraction Technique (FRAME), which combines Forward Selection and Recursive Feature Elimination (RFE) to enhance feature selection across diverse datasets. By combining the exploratory capabilities of Forward Selection with the refinement strengths of RFE, FRAME systematically identifies optimal feature subsets, striking a harmonious trade-off between experimentation and precision. A comprehensive evaluation of FRAME is conducted against traditional methods such as SelectKBest and Lasso Regression, using high-dimensional, noisy, and heterogeneous datasets. The results demonstrate that FRAME consistently delivers superior predictive performance based on downstream machine learning evaluation metrics. It efficiently performs dimensionality reduction with strong model performance, thus being especially useful for applications that need interpretable and accurate predictions, e.g., biomedical diagnostics. This research emphasizes the need to evaluate feature selection techniques on diverse datasets to test their robustness and generalizability. The results indicate that FRAME has great potential for further development, especially by incorporating deep learning frameworks for adaptive and real-time feature selection in dynamic settings. By advancing feature selection methodologies, FRAME offers a practical and effective solution to improve machine learning applications across multiple domains.

[674] Bandit and Delayed Feedback in Online Structured Prediction

Yuki Shibukawa, Taira Tsuchiya, Shinsaku Sakaue, Kenji Yamanishi

Main category: cs.LG

TL;DR: Proposed online structured prediction algorithms for bandit and delayed feedback settings, achieving surrogate regret bounds independent of output set size.

Details

Motivation: Full-information feedback in online structured prediction is often unrealistic as it requires immediate access to whole complex output structures. Need algorithms that work with less demanding feedback like bandit and delayed feedback.

Method: Two main approaches: 1) Standard inverse-weighted gradient estimator for bandit feedback achieving O(√KT) regret, 2) Novel pseudo-inverse matrix estimator achieving O(T^{2/3}) regret independent of K. Also developed algorithms for delayed feedback covering full-information/bandit with fixed/variable delays.

Result: Achieved surrogate regret bounds: O(√KT) with standard estimator, O(T^{2/3}) with pseudo-inverse estimator (K-independent). Provided algorithms for delayed feedback scenarios. Numerical comparison shows performance differences.

Conclusion: Proposed practical algorithms for online structured prediction with realistic feedback constraints. Key contribution: K-independent regret bound for bandit feedback using pseudo-inverse estimator, addressing scalability issues with complex outputs.

Abstract: Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the \textit{surrogate regret}, \textit{i.e.,}~the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, \textit{bandit} and \textit{delayed} feedback. For bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.

[675] EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

Hsi-Che Lin, Yu-Chu Yu, Kai-Po Chang, Yu-Chiang Frank Wang

Main category: cs.LG

TL;DR: EMLoC enables memory-efficient fine-tuning of large foundation models using a lightweight emulator and LoRA correction, allowing fine-tuning within inference-level memory budgets.

Details

Motivation: Fine-tuning large foundation models for domain-specific tasks is prohibitively expensive due to high memory requirements beyond inference needs, limiting accessibility for most users.

Method: Uses activation-aware SVD to create a lightweight emulator from a small calibration set, fine-tunes it with LoRA, then corrects the LoRA module to align with the original model for inference.

Result: Outperforms other baselines across multiple datasets and modalities, enables fine-tuning of a 38B model (originally requiring 95GB) on a single 24GB consumer GPU.

Conclusion: EMLoC provides practical, efficient model adaptation for individual users by enabling fine-tuning within inference memory budgets without quantization.

Abstract: Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model, which originally required 95GB of memory, on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.

[676] Towards Fair In-Context Learning with Tabular Foundation Models

Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji

Main category: cs.LG

TL;DR: First investigation of fairness in tabular in-context learning shows uncertainty-based sample selection improves group fairness metrics with minimal accuracy impact.

Details

Motivation: Transformer-based tabular foundation models show promising ICL performance but their fairness implications remain unexplored, creating a need to investigate fairness in this new paradigm.

Method: Evaluated three tabular foundation models (TabPFNv2, TabICL, TabDPT) on benchmark datasets using three fairness-enhancing methods: correlation removal, group-balanced sample selection, and uncertainty-based sample selection.

Result: Uncertainty-based strategy consistently improves group fairness metrics (demographic parity, equalized odds, equal opportunity) with minimal impact on predictive accuracy.

Conclusion: Fairness in tabular ICL is important and addressable; uncertainty-based sample selection is an effective approach; code released for reproducibility.

Abstract: Transformer-based tabular foundation models have recently demonstrated promising in-context learning (ICL) performance on structured data, emerging as competitive alternatives to gradient-boosted trees. However, the fairness implications of this new paradigm remain largely unexplored. We present the first investigation of fairness in tabular ICL, evaluating three recently proposed foundation models–TabPFNv2, TabICL, and TabDPT–on multiple benchmark datasets. To mitigate biases, we explore three pre-processing fairness-enhancing methods: correlation removal (decorrelating input features from the sensitive attribute), group-balanced sample selection (ensuring equal representation of protected groups in context examples), and uncertainty-based sample selection (prioritizing context examples with high sensitive-attribute prediction uncertainty). Our experiments show that the uncertainty-based strategy consistently improves group fairness metrics (e.g., demographic parity, equalized odds, and equal opportunity) with minimal impact on predictive accuracy. We release our code to facilitate reproducibility https://github.com/patrikken/Fair-TabICL.

[677] Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity

Zichen Liu, Wei Zhang, Tiejun Li

Main category: cs.LG

TL;DR: The paper addresses singularity issues in Euclidean diffusion models for manifold data, proposing two methods (Niso-DM and Tango-DM) to improve sampling accuracy by handling multiscale score function singularities.

Details

Motivation: Euclidean diffusion models work well for standard data but struggle with manifold-structured data due to multiscale singularities in the score function when applied in ambient space. Previous works focused on special manifolds, but this paper aims to handle general manifolds directly.

Method: The authors analyze the singularity structure of the score function by decomposing it into tangential and normal components. They propose two methods: (1) Niso-DM uses non-isotropic noise to reduce scale discrepancies in the score function, and (2) Tango-DM trains only the tangential component using a tangential-only loss function.

Result: Numerical experiments show superior performance on distributions over various manifolds with complex geometries compared to standard approaches.

Conclusion: The proposed methods effectively mitigate score function singularities in Euclidean diffusion models for manifold data, enabling more accurate sampling on general manifolds without requiring explicit manifold structure utilization.

Abstract: Euclidean diffusion models have achieved remarkable success in generative modeling across diverse domains, and they have been extended to manifold cases in recent advances. Instead of explicitly utilizing the structure of special manifolds as studied in previous works, in this paper we investigate direct sampling of the Euclidean diffusion models for general manifold-structured data. We reveal the multiscale singularity of the score function in the ambient space, which hinders the accuracy of diffusion-generated samples. We then present an elaborate theoretical analysis of the singularity structure of the score function by decomposing it along the tangential and normal directions of the manifold. To mitigate the singularity and improve the sampling accuracy, we propose two novel methods: (1) Niso-DM, which reduces the scale discrepancies in the score function by utilizing a non-isotropic noise, and (2) Tango-DM, which trains only the tangential component of the score function using a tangential-only loss function. Numerical experiments demonstrate that our methods achieve superior performance on distributions over various manifolds with complex geometries.

[678] Contrastive Self-Supervised Learning As Neural Manifold Packing

Guanming Zhang, David J. Heeger, Stefano Martiniani

Main category: cs.LG

TL;DR: CLAMP is a self-supervised learning framework that reformulates representation learning as a manifold packing problem, using physics-inspired repulsive forces to separate neural manifolds of different image classes.

Details

Motivation: The paper is motivated by observations from neuroscience where neural responses to stimuli form geometric structures called neural manifolds. The authors aim to bridge insights from physics, neuroscience, and machine learning by framing representation learning as a manifold packing problem, similar to how the brain organizes and separates different stimulus classes.

Method: CLAMP introduces a loss function inspired by the potential energy of short-range repulsive particle systems (like those in simple liquids and jammed packings). Each image class consists of sub-manifolds embedding multiple augmented views of a single image. The sizes and positions of these sub-manifolds are dynamically optimized by following the gradient of a packing loss, creating interpretable dynamics similar to jamming physics.

Result: Under standard linear evaluation protocol, CLAMP achieves competitive performance with state-of-the-art self-supervised models. Analysis shows that neural manifolds corresponding to different categories emerge naturally and are effectively separated in the learned representation space.

Conclusion: CLAMP successfully bridges insights from physics, neuroscience, and machine learning by framing self-supervised learning as a manifold packing problem. The approach yields interpretable dynamics and geometrically meaningful hyperparameters while achieving competitive performance, demonstrating the potential of interdisciplinary approaches to representation learning.

Abstract: Contrastive self-supervised learning based on point-wise comparisons has been widely studied for vision tasks. In the visual cortex of the brain, neuronal responses to distinct stimulus classes are organized into geometric structures known as neural manifolds. Accurate classification of stimuli can be achieved by effectively separating these manifolds, akin to solving a packing problem. We introduce Contrastive Learning As Manifold Packing (CLAMP), a self-supervised framework that recasts representation learning as a manifold packing problem. CLAMP introduces a loss function inspired by the potential energy of short-range repulsive particle systems, such as those encountered in the physics of simple liquids and jammed packings. In this framework, each class consists of sub-manifolds embedding multiple augmented views of a single image. The sizes and positions of the sub-manifolds are dynamically optimized by following the gradient of a packing loss. This approach yields interpretable dynamics in the embedding space that parallel jamming physics, and introduces geometrically meaningful hyperparameters within the loss function. Under the standard linear evaluation protocol, which freezes the backbone and trains only a linear classifier, CLAMP achieves competitive performance with state-of-the-art self-supervised models. Furthermore, our analysis reveals that neural manifolds corresponding to different categories emerge naturally and are effectively separated in the learned representation space, highlighting the potential of CLAMP to bridge insights from physics, neural science, and machine learning.

[679] Is Grokking a Computational Glass Relaxation?

Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang

Main category: cs.LG

TL;DR: Grokking is framed as computational glass relaxation, not a first-order phase transition. The study finds no entropy barrier in memorization-to-generalization transition, identifies high-entropy advantage, and develops a toy optimizer that eliminates grokking.

Details

Motivation: To understand neural network generalizability through the lens of grokking phenomenon, which offers a unique window to investigate underlying mechanisms of generalization in deep learning.

Method: Framing grokking as computational glass relaxation: viewing NNs as physical systems where parameters are degrees of freedom and train loss is system energy. Sampling NNs’ Boltzmann entropy landscape as function of training loss and test accuracy. Experiments with transformers on arithmetic tasks. Development of WanD optimizer based on Wang-landau molecular dynamics.

Result: No entropy barrier found in memorization-to-generalization transition of grokking, challenging previous first-order phase transition theory. Identified high-entropy advantage under grokking. WanD optimizer successfully eliminates grokking without constraints and finds high-norm generalizing solutions.

Conclusion: Grokking is not a first-order phase transition with entropy barrier, but rather a computational glass relaxation process. The findings provide counterexamples to theories attributing grokking solely to weight norm evolution and suggest new directions for optimizer design based on far-from-equilibrium dynamics.

Abstract: Understanding neural network’s (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs’ generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs’ Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking’s far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.

[680] v-PuNNs: van der Put Neural Networks for Transparent Ultrametric Representation Learning

Gnankan Landry Regis N’guessan

Main category: cs.LG

TL;DR: v-PuNNs are novel neural networks using p-adic numbers for hierarchical data, achieving SOTA on benchmarks with perfect ultrametric properties and interpretable weights.

Details

Motivation: Euclidean space is poorly suited for hierarchical data like taxonomies, word senses, or file systems. Current models lack exact subtree semantics and interpretability for strictly hierarchical structures.

Method: Introduces van der Put Neural Networks (v-PuNNs) with neurons as characteristic functions of p-adic balls in ℤₚ. Uses Transparent Ultrametric Representation Learning (TURL) where weights are p-adic numbers. Employs Valuation-Adaptive Perturbation Optimization (VAPO) for training in discrete space, with deterministic (HiPaN-DS) and moment-based (Adam-VAPO) variants.

Result: Achieves SOTA on three benchmarks: WordNet nouns (99.96% leaf accuracy in 16 min), GO molecular-function (96.9% leaf/100% root in 50s), NCBI Mammalia (Spearman ρ=-0.96). Learned metric is perfectly ultrametric with zero triangle violations. Also extends to quantum systems (HiPaQ) and tabular data generation (Tab-HiPaN).

Conclusion: v-PuNNs bridge number theory and deep learning, providing exact, interpretable, and efficient models for hierarchical data with theoretical guarantees and practical performance.

Abstract: Conventional deep learning models embed data in Euclidean space $\mathbb{R}^d$, a poor fit for strictly hierarchical objects such as taxa, word senses, or file systems. We introduce van der Put Neural Networks (v-PuNNs), the first architecture whose neurons are characteristic functions of p-adic balls in $\mathbb{Z}p$. Under our Transparent Ultrametric Representation Learning (TURL) principle every weight is itself a p-adic number, giving exact subtree semantics. A new Finite Hierarchical Approximation Theorem shows that a depth-K v-PuNN with $\sum{j=0}^{K-1}p^{,j}$ neurons universally represents any K-level tree. Because gradients vanish in this discrete space, we propose Valuation-Adaptive Perturbation Optimization (VAPO), with a fast deterministic variant (HiPaN-DS) and a moment-based one (HiPaN / Adam-VAPO). On three canonical benchmarks our CPU-only implementation sets new state-of-the-art: WordNet nouns (52,427 leaves) 99.96% leaf accuracy in 16 min; GO molecular-function 96.9% leaf / 100% root in 50 s; NCBI Mammalia Spearman $ρ= -0.96$ with true taxonomic distance. The learned metric is perfectly ultrametric (zero triangle violations), and its fractal and information-theoretic properties are analyzed. Beyond classification we derive structural invariants for quantum systems (HiPaQ) and controllable generative codes for tabular data (Tab-HiPaN). v-PuNNs therefore bridge number theory and deep learning, offering exact, interpretable, and efficient models for hierarchical data.

[681] Accelerating Sparse Transformer Inference on GPU

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

Main category: cs.LG

TL;DR: STOF is a framework that optimizes sparse Transformer performance on GPUs through flexible masking, operator fusion, and adaptive kernel mapping, achieving up to 1.6x speedup in MHA computation and 1.4x in end-to-end inference.

Details

Motivation: Current sparse Transformer implementations lack performance optimization, and static operator fusion schemes cannot adapt to diverse application scenarios. There's a need for better GPU acceleration of sparse Transformers with flexible masking capabilities.

Method: STOF uses analytical modeling to map MHA computation to row-wise or blockwise kernels with unique storage formats. For downstream operators, it maps fusion schemes to compilation templates and determines optimal configurations through two-stage searching.

Result: STOF achieves maximum speedups of 1.6x in multi-head attention computation and 1.4x in end-to-end inference compared to state-of-the-art methods.

Conclusion: STOF successfully addresses performance optimization challenges in sparse Transformers by providing flexible masking, adaptive operator fusion, and efficient GPU kernel mapping, significantly improving inference speed.

Abstract: Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

[682] Ambiguous Online Learning

Vanessa Kosoy

Main category: cs.LG

TL;DR: Proposes “ambiguous online learning” where learners can output multiple predicted labels, considered correct if at least one label is correct and none are “predictably wrong” according to an unknown multi-valued hypothesis.

Details

Motivation: Natural setting for multivalued dynamical systems, recommendation algorithms, and lossless compression; strongly related to "apple tasting" problems where partial correctness matters.

Method: Defines ambiguous online learning framework with multi-valued predictions and hypotheses, where predictions are correct if at least one label is correct and none violate the unknown true hypothesis.

Result: Shows a trichotomy of mistake bounds: up to logarithmic factors, any hypothesis class has optimal mistake bound of either Θ(1), Θ(√N), or N.

Conclusion: Introduces a novel online learning variant with practical applications and establishes fundamental mistake bound classification for hypothesis classes in this setting.

Abstract: We propose a new variant of online learning that we call “ambiguous online learning”. In this setting, the learner is allowed to produce multiple predicted labels. Such an “ambiguous prediction” is considered correct when at least one of the labels is correct, and none of the labels are “predictably wrong”. The definition of “predictably wrong” comes from a hypothesis class in which hypotheses are also multi-valued. Thus, a prediction is “predictably wrong” if it’s not allowed by the (unknown) true hypothesis. In particular, this setting is natural in the context of multivalued dynamical systems, recommendation algorithms and lossless compression. It is also strongly related to so-called “apple tasting”. We show that in this setting, there is a trichotomy of mistake bounds: up to logarithmic factors, any hypothesis class has an optimal mistake bound of either Theta(1), Theta(sqrt(N)) or N.

[683] Detecting Proxy Gaming in RL and LLM Alignment via Evaluator Stress Tests

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: EST framework detects proxy gaming in AI systems by testing evaluator vulnerabilities through controlled perturbations, validated across RL and LLM domains with high precision/recall.

Details

Motivation: Proxy optimization (reward hacking in RL, evaluator gaming in LLM alignment) threatens AI safety by allowing systems to exploit evaluator weaknesses rather than genuinely improving intended objectives.

Method: Evaluator Stress Test (EST) - an invariance-based framework that separates exploitable sensitivity (formatting artifacts, physics bugs) from content-driven improvements using controlled perturbations with semantic validity audits.

Result: In RL: 78.4% precision, 81.7% recall across 15 environments and 5 algorithms (2,156 episodes). In LLM alignment: 74.2% precision, 78.6% recall across 4 tasks, 2 model scales, 2 training methods, 2 judges (1,200 instances). Early warning signals precede quality decline. Closed-loop mitigation improves human win-rate by 8.3 points (LLM) and reduces hacking by 54.6% (RL).

Conclusion: EST effectively detects proxy gaming across domains, provides early warning signals, and enables mitigation. Cross-domain analysis shows proxy-true correlation tracking transfers directly while perturbation design requires domain adaptation. Benchmarks released for both domains.

Abstract: Proxy optimization, where AI systems exploit evaluator weaknesses rather than improve intended objectives, threatens both reinforcement learning (reward hacking) and LLM alignment (evaluator gaming). We introduce the Evaluator Stress Test (EST), an invariance-based framework that detects proxy gaming by separating exploitable sensitivity (e.g., formatting artifacts, physics bugs) from content-driven improvements using controlled perturbations with semantic validity audits. We validate EST across both domains. In RL, across 15 environments and 5 algorithms (2,156 expert-annotated episodes), EST achieves 78.4% precision and 81.7% recall. In LLM alignment, across 4 tasks, 2 model scales, 2 training methods, and 2 judges (1,200 human-annotated instances), EST achieves 74.2% precision and 78.6% recall, with early warning signals that precede quality decline. Cross-domain analysis shows that proxy-true correlation tracking transfers directly between domains, while perturbation design requires domain adaptation. Closed-loop mitigation improves human win-rate by 8.3 points (LLM) and reduces hacking by 54.6% (RL). We release benchmarks for both domains: 2,156 RL episodes and 1,200 LLM instances.

[684] Reinforcement Learning via Conservative Agent for Environments with Random Delays

Jongsoo Lee, Jangwon Kim, Jiseok Jeong, Soohee Han

Main category: cs.LG

TL;DR: Proposes a “conservative agent” that transforms random-delay RL environments into constant-delay equivalents, enabling existing constant-delay methods to work in random-delay settings without modification.

Details

Motivation: Real-world RL applications often face delayed feedback, violating Markov assumptions. While constant delays have solutions, random delays remain largely unexplored due to their variability and unpredictability.

Method: The conservative agent reformulates random-delay environments into constant-delay equivalents, allowing any state-of-the-art constant-delay method to be directly applied to random-delay environments without algorithmic modifications.

Result: Empirical evaluation on continuous control tasks shows the conservative agent-based algorithm significantly outperforms existing baselines in both asymptotic performance and sample efficiency.

Conclusion: The proposed approach provides a simple yet robust solution for decision-making under random delays, enabling seamless extension of existing constant-delay methods to more challenging random-delay environments.

Abstract: Real-world reinforcement learning applications are often hindered by delayed feedback from environments, which violates the Markov assumption and introduces significant challenges. Although numerous delay-compensating methods have been proposed for environments with constant delays, environments with random delays remain largely unexplored due to their inherent variability and unpredictability. In this study, we propose a simple yet robust agent for decision-making under random delays, termed the conservative agent, which reformulates the random-delay environment into its constant-delay equivalent. This transformation enables any state-of-the-art constant-delay method to be directly extended to the random-delay environments without modifying the algorithmic structure or sacrificing performance. We evaluate the conservative agent-based algorithm on continuous control tasks, and empirical results demonstrate that it significantly outperforms existing baseline algorithms in terms of asymptotic performance and sample efficiency.

[685] Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation

Hongze Sun, Wuque Cai, Duo Chen, Quan Tang, Shifeng Mao, Jiayi He, Zhenxing Wang, Yan Cui, Dezhong Yao, Daqing Guo

Main category: cs.LG

TL;DR: The paper proposes combining synapse pruning with synergistic learning-based compensation to create lightweight spiking Transformer models that reduce parameters and computational costs while maintaining performance.

Details

Motivation: Existing spiking Transformer models require substantial parameters and incur high computational costs, limiting deployment in resource-constrained environments.

Method: Two pruning strategies: unstructured L1P for sparse representations and structured DSP for low-rank representations, plus an enhanced sLIF neuron model with synergistic learning between synaptic and intrinsic plasticity mechanisms for compensation.

Result: Extensive experiments show significant reduction in model size and computational overhead while maintaining competitive performance on benchmark datasets.

Conclusion: The proposed pruning and compensation strategies effectively construct efficient and high-performing spiking Transformer-based models.

Abstract: As a foundational architecture of artificial intelligence models, Transformer has been recently adapted to spiking neural networks with promising performance across various tasks. However, existing spiking Transformer(ST)-based models require a substantial number of parameters and incur high computational costs, thus limiting their deployment in resource-constrained environments. To address these challenges, we propose combining synapse pruning with a synergistic learning-based compensation strategy to derive lightweight ST-based models. Specifically, two types of tailored pruning strategies are introduced to reduce redundancy in the weight matrices of ST blocks: an unstructured $\mathrm{L_{1}P}$ method to induce sparse representations, and a structured DSP method to induce low-rank representations. In addition, we propose an enhanced spiking neuron model, termed the synergistic leaky integrate-and-fire (sLIF) neuron, to effectively compensate for model pruning through synergistic learning between synaptic and intrinsic plasticity mechanisms. Extensive experiments on benchmark datasets demonstrate that the proposed methods significantly reduce model size and computational overhead while maintaining competitive performance. These results validate the effectiveness of the proposed pruning and compensation strategies in constructing efficient and high-performing ST-based models.

[686] Non-omniscient backdoor injection with one poison sample: Proving the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks

Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi

Main category: cs.LG

TL;DR: The paper proves that a single poisoned sample is sufficient to inject a backdoor into ML models without significantly affecting benign task performance, establishing the “one-poison hypothesis” for linear models and 2-layer ReLU networks.

Details

Motivation: Current backdoor attacks either use few samples with extensive data knowledge or require poisoning many data points. The paper addresses the open question of the minimum poison data needed for successful backdoor attacks, proposing that a single poison sample with limited background knowledge can be sufficient.

Method: The authors formulate the “one-poison hypothesis” and provide theoretical proofs for linear regression, linear classification, and 2-layer ReLU neural networks. They analyze cases where adversaries use directions unused by clean data distribution and build on prior statistical backdoor learning work to show limited impact on benign tasks.

Result: Theoretical proofs demonstrate that with one poison sample, adversaries can achieve zero backdooring-error while maintaining benign task performance. For linear models using unused directions, the poisoned model is functionally equivalent to one trained without poison. Experimental validation on benchmark datasets confirms theoretical findings.

Conclusion: The one-poison hypothesis is valid for several ML models, showing that even a single poisoned sample can successfully inject backdoors with minimal impact on normal task performance, highlighting significant security vulnerabilities in ML systems trained on untrusted data.

Abstract: Backdoor poisoning attacks are a threat to machine learning models trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples but need much information about the data points, or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks. For adversaries that utilize a direction unused by the clean data distribution for the poison sample, we prove for linear classification and linear regression that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We validate our theoretical results experimentally with realistic benchmark data sets.

[687] How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System

Kaiwen Zuo, Zelin Liu, Raman Dutt, Ziyang Wang, Zhongtian Sun, Fan Mo, Pietro Liò

Main category: cs.LG

TL;DR: MedThreatRAG is a novel multimodal poisoning framework that attacks medical RAG systems by injecting adversarial image-text pairs with cross-modal semantic contradictions, exposing security vulnerabilities in clinical AI systems.

Details

Motivation: Medical RAG systems that use external clinical image-text retrieval create significant attack surfaces, but current security vulnerabilities in these multimodal systems are not well understood or addressed.

Method: Proposes MedThreatRAG framework with simulated semi-open attack environment mimicking real medical systems. Introduces Cross-Modal Conflict Injection (CMCI) that embeds subtle semantic contradictions between medical images and paired reports to disrupt cross-modal alignment while evading conventional filters.

Result: MedThreatRAG reduces answer F1 scores by up to 27.66% on IU-Xray and MIMIC-CXR QA tasks, lowering LLaVA-Med-1.5 F1 rates to as low as 51.36%. CMCI demonstrates the most severe degradation compared to basic textual/visual attacks.

Conclusion: Exposes fundamental security gaps in clinical RAG systems, highlighting urgent need for threat-aware design and robust multimodal consistency checks. Provides guidelines for safe development of future multimodal medical RAG systems.

Abstract: Large Vision-Language Models (LVLMs) augmented with Retrieval-Augmented Generation (RAG) are increasingly employed in medical AI to enhance factual grounding through external clinical image-text retrieval. However, this reliance creates a significant attack surface. We propose MedThreatRAG, a novel multimodal poisoning framework that systematically probes vulnerabilities in medical RAG systems by injecting adversarial image-text pairs. A key innovation of our approach is the construction of a simulated semi-open attack environment, mimicking real-world medical systems that permit periodic knowledge base updates via user or pipeline contributions. Within this setting, we introduce and emphasize Cross-Modal Conflict Injection (CMCI), which embeds subtle semantic contradictions between medical images and their paired reports. These mismatches degrade retrieval and generation by disrupting cross-modal alignment while remaining sufficiently plausible to evade conventional filters. While basic textual and visual attacks are included for completeness, CMCI demonstrates the most severe degradation. Evaluations on IU-Xray and MIMIC-CXR QA tasks show that MedThreatRAG reduces answer F1 scores by up to 27.66% and lowers LLaVA-Med-1.5 F1 rates to as low as 51.36%. Our findings expose fundamental security gaps in clinical RAG systems and highlight the urgent need for threat-aware design and robust multimodal consistency checks. Finally, we conclude with a concise set of guidelines to inform the safe development of future multimodal medical RAG systems.

[688] Applying Deep Learning to Anomaly Detection of Russian Satellite Activity for Indications Prior to Military Activity

David Kurtenbach, Megan Manly, Zach Metzinger

Main category: cs.LG

TL;DR: Deep learning anomaly detection applied to Russian satellite activity before Ukraine invasion using TLE data, identifying statistically significant anomalies in orbital behavior as potential military I&W indicators.

Details

Motivation: To detect indications and warnings of aggressive military behavior by analyzing anomalous activity of Russian-owned space objects prior to the Ukraine invasion, using publicly available TLE data to understand potential tactics and procedures.

Method: Used statistical and deep learning approaches including isolation forest, traditional autoencoder, variational autoencoder, Kolmogorov Arnold Network, and a novel anchor-loss based autoencoder. Individual models trained for each RSO using 5-year baseline data, focusing on 6 months pre-invasion and post-invasion periods. Anomalies detected via reconstruction errors exceeding threshold sigma, with explainable analysis of individual orbital elements.

Result: Identified statistically significant anomalies in Russian RSO activity, with detailed findings at individual orbital element level, demonstrating potential for using space object behavior as I&W indicators for future conflicts.

Conclusion: Deep learning anomaly detection on publicly available TLE data can effectively identify significant changes in Russian satellite patterns of life/behavior, providing valuable indications and warnings of potential military aggression with explainable, element-level insights.

Abstract: We apply deep learning techniques for anomaly detection to analyze activity of Russian-owned resident space objects (RSO) prior to the Ukraine invasion and assess the results for any findings that can be used as indications and warnings (I&W) of aggressive military behavior for future conflicts. Through analysis of anomalous activity, an understanding of possible tactics and procedures can be established to assess the existence of statistically significant changes in Russian RSO pattern of life/pattern of behavior (PoL/PoB) using publicly available two-line element (TLE) data. This research looks at statistical and deep learning approaches to assess anomalous activity. The deep learning methods assessed are isolation forest (IF), traditional autoencoder (AE), variational autoencoder (VAE), Kolmogorov Arnold Network (KAN), and a novel anchor-loss based autoencoder (Anchor AE). Each model is used to establish a baseline of on-orbit activity based on a five-year data sample. The primary investigation period focuses on the six months leading up to the invasion date of February 24, 2022. Additional analysis looks at RSO activity during an active combat period by sampling TLE data after the invasion date. The deep learning autoencoder models identify anomalies based on reconstruction errors that surpass a threshold sigma. To capture the nuance and unique characteristics of each RSO an individual model was trained for each observed space object. The research made an effort to prioritize explainability and interpretability of the model results thus each observation was assessed for anomalous behavior of the individual six orbital elements versus analyzing the input data as a single monolithic observation. The results demonstrate not only statistically significant anomalies of Russian RSO activity but also details anomalous findings to the individual orbital element.

[689] An entropy formula for the Deep Linear Network

Govind Menon, Tianmin Yu

Main category: cs.LG

TL;DR: The paper studies Riemannian geometry of Deep Linear Networks to develop a thermodynamic description of learning, using group actions and Riemannian submersion to analyze overparametrization and compute Boltzmann entropy.

Details

Motivation: To establish a thermodynamic framework for understanding the learning process in deep neural networks by analyzing their Riemannian geometric structure, particularly focusing on overparametrization and entropy.

Method: Uses group actions to analyze overparametrization, Riemannian submersion from parameter space to observable space, foliation of balanced manifold by group orbits, and explicit construction of orthonormal basis for tangent space using Jacobi matrix theory.

Result: Shows that Riemannian geometry on observable space is obtained by Riemannian submersion of balanced manifold, and defines/computes Boltzmann entropy using the foliation structure.

Conclusion: Provides a geometric foundation for thermodynamic description of learning in deep linear networks through Riemannian geometry, group theory, and entropy concepts.

Abstract: We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.

[690] Kriging prior Regression: A Case for Kriging-Based Spatial Features with TabPFN in Soil Mapping

Jonas Schmidinger, Viacheslav Barkov, Sebastian Vogel, Martin Atzmueller, Gerard B M Heuvelink

Main category: cs.LG

TL;DR: Kriging prior regression (KpR) combines machine learning with spatial context using spatial lag features from ordinary kriging, improving soil property predictions by ~30% R2 compared to non-spatial ML methods.

Details

Motivation: Machine learning and geostatistics have complementary strengths for soil mapping - ML captures feature relationships while geostatistics leverages spatial structure. The authors aim to create a hybrid framework that combines both approaches for better predictions in precision agriculture.

Method: Proposed kriging prior regression (KpR) enriches ML with spatial context by engineering ‘spatial lag’ features from ordinary kriging. This follows inverse logic of regression kriging. Evaluated using TabPFN model on six field-scale datasets from LimeSoDa containing soil organic carbon, clay content, pH with remote sensing and proximal soil sensing features.

Result: KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions compared to spatial techniques (regression/residual kriging) and non-spatial ML algorithms (random forest). Improved average R2 by ~30% compared to ML without spatial context. TabPFN works well with small sample sizes, while KpR compensates for weak feature-soil relationships when sensing data is limited.

Conclusion: KpR with TabPFN is a robust and versatile modeling framework for digital soil mapping in precision agriculture, combining strong prediction performance of TabPFN with complementary spatial information from KpR features.

Abstract: Machine learning and geostatistics are two fundamentally different frameworks for predicting and spatially mapping soil properties. Geostatistics leverages the spatial structure of soil properties, while machine learning captures the relationship between available environmental features and soil properties. We propose a hybrid framework that enriches ML with spatial context through engineering of ‘spatial lag’ features from ordinary kriging. We call this approach ‘kriging prior regression’ (KpR), as it follows the inverse logic of regression kriging. To evaluate this approach, we assessed both the point and probabilistic prediction performance of KpR, using the TabPFN model across six fieldscale datasets from LimeSoDa. These datasets included soil organic carbon, clay content, and pH, along with features derived from remote sensing and in-situ proximal soil sensing. KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions in comparison to several other spatial techniques (e.g., regression/residual kriging with TabPFN), as well as to established non-spatial machine learning algorithms (e.g., random forest). Most notably, it significantly improved the average R2 by around 30% compared to machine learning algorithms without spatial context. This improvement was due to the strong prediction performance of the TabPFN algorithm itself and the complementary spatial information provided by KpR features. TabPFN is particularly effective for prediction tasks with small sample sizes, common in precision agriculture, whereas KpR can compensate for weak relationships between sensing features and soil properties when proximal soil sensing data are limited. Hence, we conclude that KpR with TabPFN is a very robust and versatile modelling framework for digital soil mapping in precision agriculture.

[691] TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding

Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan

Main category: cs.LG

TL;DR: TimeMosaic is a multivariate time series forecasting framework that addresses temporal heterogeneity through adaptive patch embedding and segment-wise decoding, outperforming existing methods.

Details

Motivation: Existing patch-based methods use fixed-length segmentation, which overlooks heterogeneity in local temporal dynamics and forecasting horizons. This leads to loss of details in information-dense regions, redundancy in stable segments, and failure to capture different complexities of short-term vs long-term forecasting.

Method: TimeMosaic uses adaptive patch embedding to dynamically adjust granularity based on local information density, balancing motif reuse with structural clarity while preserving temporal continuity. It also employs segment-wise decoding that treats each prediction horizon as a related subtask, adapting to horizon-specific difficulty and information requirements rather than using a single uniform decoder.

Result: Extensive evaluations on benchmark datasets show consistent improvements over existing methods. The model trained on a large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art Time Series Foundation Models (TSFMs).

Conclusion: TimeMosaic effectively addresses temporal heterogeneity in multivariate time series forecasting through its adaptive patch embedding and segment-wise decoding approach, demonstrating superior performance across various benchmarks and competitive results with large-scale training.

Abstract: Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.

[692] Discovering Association Rules in High-Dimensional Small Tabular Data

Erkan Karabulut, Daniel Daza, Paul Groth, Victoria Degeler

Main category: cs.LG

TL;DR: Aerial+ neurosymbolic method scales 10-100x better than SOTA for high-dimensional ARM, and new fine-tuning approaches using tabular foundation models improve rule quality in low-data, high-dimensional scenarios.

Details

Motivation: Association Rule Mining faces rule explosion and computational overhead in high-dimensional settings, and existing neurosymbolic methods like Aerial+ inherit neural network limitations in low-data regimes.

Method: Two fine-tuning approaches to Aerial+ using tabular foundation models, addressing ARM in high-dimensional, low-data settings (e.g., 18k features, 50 samples).

Result: Aerial+ scales 10-100x better than SOTA baselines across five real-world datasets, and proposed fine-tuning approaches significantly improve rule quality in low-data, high-dimensional scenarios.

Conclusion: The paper advances ARM for high-dimensional tabular data by improving scalability and addressing low-data regimes through foundation model integration, enabling practical application in domains like biomedicine.

Abstract: Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.

[693] Spatio-Temporal Graph Deep Learning with Stochastic Differential Equations for Uncovering Alzheimer’s Disease Progression

Houliang Zhou, Rong Zhou, Yangying Liu, Kanhao Zhao, Li Shen, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer’s Disease Neuroimaging Initiative

Main category: cs.LG

TL;DR: Interpretable spatio-temporal graph neural network using dual Stochastic Differential Equations to predict Alzheimer’s disease progression from irregular longitudinal fMRI data, identifying key brain circuit abnormalities.

Details

Motivation: Existing methods overlook complex spatio-temporal dysfunctions in brain networks, making it challenging to identify objective neuroimaging biomarkers for forecasting Alzheimer's disease progression.

Method: Developed an interpretable spatio-temporal graph neural network framework using dual Stochastic Differential Equations to model irregularly-sampled longitudinal fMRI data.

Result: Identified key brain circuit abnormalities including parahippocampal cortex, prefrontal cortex, and parietal lobule, with disruptions in ventral attention, dorsal attention, and default mode networks that correlate with clinical symptoms.

Conclusion: The framework enables early, individualized prediction of AD progression and reveals both established and novel neural systems-level and sex-specific biomarkers, offering new insights into AD neurobiological mechanisms.

Abstract: Identifying objective neuroimaging biomarkers to forecast Alzheimer’s disease (AD) progression is crucial for timely intervention. However, this task remains challenging due to the complex dysfunctions in the spatio-temporal characteristics of underlying brain networks, which are often overlooked by existing methods. To address these limitations, we develop an interpretable spatio-temporal graph neural network framework to predict future AD progression, leveraging dual Stochastic Differential Equations (SDEs) to model the irregularly-sampled longitudinal functional magnetic resonance imaging (fMRI) data. We validate our approach on two independent cohorts, including the Open Access Series of Imaging Studies (OASIS-3) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our framework effectively learns sparse regional and connective importance probabilities, enabling the identification of key brain circuit abnormalities associated with disease progression. Notably, we detect the parahippocampal cortex, prefrontal cortex, and parietal lobule as salient regions, with significant disruptions in the ventral attention, dorsal attention, and default mode networks. These abnormalities correlate strongly with longitudinal AD-related clinical symptoms. Moreover, our interpretability strategy reveals both established and novel neural systems-level and sex-specific biomarkers, offering new insights into the neurobiological mechanisms underlying AD progression. Our findings highlight the potential of spatio-temporal graph-based learning for early, individualized prediction of AD progression, even in the context of irregularly-sampled longitudinal imaging data.

[694] LLM Interpretability with Identifiable Temporal-Instantaneous Representation

Xiangchen Song, Jiaqi Sun, Zijian Li, Yujia Zheng, Kun Zhang

Main category: cs.LG

TL;DR: A temporal causal representation learning framework that bridges mechanistic interpretability and causal representation learning to uncover both time-delayed and instantaneous causal relations in LLMs’ high-dimensional concept space with theoretical guarantees.

Details

Motivation: Current mechanistic interpretability tools like sparse autoencoders lack temporal dependency modeling, instantaneous relation representation, and theoretical guarantees, while existing causal representation learning methods cannot scale to LLMs' rich conceptual space due to computational inefficiency.

Method: Introduces an identifiable temporal causal representation learning framework specifically designed for LLMs’ high-dimensional concept space, capturing both time-delayed and instantaneous causal relations, and extends SAE techniques with this temporal causal framework.

Result: The approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques, the framework successfully discovers meaningful concept relationships in LLM activations.

Conclusion: Modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs, bridging the gap between mechanistic interpretability and theoretically-grounded causal representation learning.

Abstract: Despite Large Language Models’ remarkable capabilities, understanding their internal representations remains challenging. Mechanistic interpretability tools such as sparse autoencoders (SAEs) were developed to extract interpretable features from LLMs but lack temporal dependency modeling, instantaneous relation representation, and more importantly theoretical guarantees, undermining both the theoretical foundations and the practical confidence necessary for subsequent analyses. While causal representation learning (CRL) offers theoretically grounded approaches for uncovering latent concepts, existing methods cannot scale to LLMs’ rich conceptual space due to inefficient computation. To bridge the gap, we introduce an identifiable temporal causal representation learning framework specifically designed for LLMs’ high-dimensional concept space, capturing both time-delayed and instantaneous causal relations. Our approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques with our temporal causal framework, we successfully discover meaningful concept relationships in LLM activations. Our findings show that modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs.

[695] InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan

Main category: cs.LG

TL;DR: InfMasking is a contrastive synergistic information extraction method that uses infinite stochastic masking during multimodal fusion to enhance synergistic interactions between modalities, achieving SOTA performance across seven benchmarks.

Details

Motivation: Existing multimodal representation learning methods struggle to capture the full spectrum of synergistic information - the unique outcomes from modality interactions that no single modality can achieve alone. This is problematic because synergistic information constitutes the fundamental value proposition of multimodal representation.

Method: InfMasking uses an Infinite Masking strategy that stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. An InfMasking loss is derived to approximate the computationally prohibitive mutual information calculation.

Result: Controlled experiments demonstrate that InfMasking effectively enhances synergistic information between modalities. Evaluations on large-scale real-world datasets show InfMasking achieves state-of-the-art performance across seven benchmarks.

Conclusion: InfMasking successfully addresses the challenge of capturing synergistic information in multimodal representation learning through its infinite masking strategy, enabling richer interaction patterns and superior performance on multimodal tasks.

Abstract: In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

[696] Massively Multimodal Foundation Models: A Framework for Capturing Dependencies with Specialized Mixture-of-Experts

Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, Suchi Saria

Main category: cs.LG

TL;DR: A framework for massively multimodal learning that uses temporal dependency analysis to guide Mixture-of-Experts routing, improving performance on tasks with dozens of heterogeneous input streams.

Details

Motivation: Modern applications involve dozens of heterogeneous input streams (clinical sensors, wearables, imaging, text) with distinct characteristics. Existing MoE architectures route tokens based on similarity alone, overlooking rich temporal dependencies across modalities, which becomes essential as modality count grows.

Method: Proposes a framework that explicitly quantifies temporal dependencies between modality pairs across multiple time lags and uses these to guide MoE routing. A dependency-aware router dispatches tokens to specialized experts based on interaction type, enabling experts to learn generalizable dependency-processing skills.

Result: Experiments across healthcare, activity recognition, and affective computing benchmarks demonstrate substantial performance gains and interpretable routing patterns aligned with domain knowledge.

Conclusion: The proposed dependency-aware MoE routing framework effectively handles massively multimodal settings by capturing complex, time-varying dependencies between modalities, outperforming conventional similarity-based routing approaches.

Abstract: Modern applications increasingly involve dozens of heterogeneous input streams, such as clinical sensors, wearables, imaging, and text, each with distinct measurement models, sampling rates, and noise characteristics. This \textit{massively multimodal} setting, where each sensor constitutes a separate modality, fundamentally differs from conventional multimodal learning focused on two or three modalities. As modality count grows, capturing their complex, time-varying dependencies becomes essential yet challenging. Mixture-of-Experts (MoE) architectures are naturally suited for this setting, their sparse routing mechanism enables efficient scaling across many modalities. Existing MoE architectures route tokens based on similarity alone, overlooking the rich temporal dependencies across modalities. We propose a framework that explicitly quantifies temporal dependencies between modality pairs across multiple time lags and uses these to guide MoE routing. A dependency-aware router dispatches tokens to specialized experts based on interaction type. This principled routing enables experts to learn generalizable dependency-processing skills. Experiments across healthcare, activity recognition, and affective computing benchmarks demonstrate substantial performance gains and interpretable routing patterns aligned with domain knowledge.

[697] ReNF: Rethinking the Design Space of Neural Long-Term Time Series Forecasters

Yihang Lu, Xianwei Meng, Enhong Chen

Main category: cs.LG

TL;DR: A streamlined forecasting paradigm (BDO) combining AR and Direct Output approaches outperforms complex SOTA models by focusing on fundamental principles rather than architectural complexity.

Details

Motivation: Recent progress in Neural Forecasters has been hampered by overemphasis on architectural complexity at the expense of fundamental forecasting principles. The authors aim to revisit core LTSF principles to address this issue.

Method: Proposes Boosted Direct Output (BDO) - a paradigm that hybridizes causal Auto-Regressive structure with stable Direct Output, implicitly realizing forecast combination within a single network. Uses parameter smoothing to address validation-test generalization gap.

Result: A direct temporal MLP with these principled improvements outperforms recent complex state-of-the-art models in nearly all benchmarks, without relying on intricate inductive biases.

Conclusion: Focusing on fundamental forecasting principles (forecast combination, stable optimization) rather than architectural complexity enables simple models to outperform complex SOTA approaches, establishing promising directions for future research.

Abstract: Neural Forecasters (NFs) have become a cornerstone of Long-term Time Series Forecasting (LTSF). However, recent progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting principles. In this work, we revisit the principles of LTSF. We begin by formulating a Variance Reduction Hypothesis (VRH), positing that generating and combining multiple forecasts is essential to reducing the inherent uncertainty of NFs. Guided by this, we propose Boosted Direct Output (BDO), a streamlined paradigm that synergistically hybridizes the causal structure of Auto-Regressive (AR) with the stability of Direct Output (DO), while implicitly realizing the principle of forecast combination within a single network. Furthermore, we address the critical validation-test generalization gap by employing parameter smoothing to stabilize optimization. Extensive experiments demonstrate that these trivial yet principled improvements enable a direct temporal MLP to outperform recent, complex state-of-the-art models in nearly all benchmarks, without relying on intricate inductive biases. Finally, we empirically verify our hypothesis, establishing a dynamic performance bound that highlights promising directions for future research. The code for review is available at: https://anonymous.4open.science/r/ReNF-A151.

[698] Anytime-Valid Answer Sufficiency Certificates for LLM Generation via Sequential Information Lift

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.LG

TL;DR: Sequential-EDFL applies anytime-valid sequential testing to language model generation, using information lift tracking with formal error control to stop generation early while maintaining statistical guarantees.

Details

Motivation: To reduce unnecessary language model generation length while maintaining formal statistical guarantees about information sufficiency, addressing the computational inefficiency of generating full sequences when early stopping is possible.

Method: Uses sequential testing with self-normalized empirical-Bernstein e-processes to track information lift (log-likelihood ratio between full model and weakened “skeleton” baselines), with online mean estimation, mixture e-processes for multiple parameters, and adaptive resets for distributional drift.

Result: Reduces generation length by 22-28% relative to sequential baselines while maintaining delta-level control with 12% computational overhead; automated skeletons (distilled submodels and randomized logits) show robustness; when combined with correctness gate, validates 83% of stops.

Conclusion: EDFL provides formal anytime-valid guarantees for information sufficiency but not factual correctness, serving as a first-stage filter to reduce verification burden; not a standalone solution for safety-critical domains but can significantly improve efficiency.

Abstract: We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), which applies anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift, defined as the log-likelihood ratio between the full model and deliberately weakened “skeleton” baselines, using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. This delta guarantee controls premature stopping when information lift is insufficient relative to the skeleton, and it does not imply delta control of factual incorrectness or hallucinations. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation length by 22 to 28 percent relative to sequential baselines while maintaining delta-level control with 12 percent computational overhead. We introduce automated skeletons (distilled submodels and randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries plus a verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness. Specifically, 10.9 percent of stopped sequences remain incorrect even with the gate (13.2 to 22.7 percent without it). EDFL serves as a first-stage filter that can reduce verification burden: when applied to stopped sequences, the gate validates 83 percent of stops, requiring full verification only for the remaining 17 percent, plus all non-stopped sequences. EDFL is not a standalone solution for safety-critical domains.

[699] PIKAN: Physics-Inspired Kolmogorov-Arnold Networks for Explainable UAV Channel Modelling

Kürşat Tekbıyık, Güneş Karabulut Kurt, Antoine Lesage-Landry

Main category: cs.LG

TL;DR: PIKAN embeds physical principles into neural networks for UAV channel modeling, achieving DL-level accuracy with interpretable symbolic expressions and far fewer parameters.

Details

Motivation: UAV communications need accurate yet interpretable A2G channel models that can adapt to nonstationary environments. Current approaches either offer interpretability but lack flexibility (deterministic models) or provide accuracy but lack explainability (DL models).

Method: Proposes Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (free-space path loss, two-ray reflections) as flexible inductive biases into the learning process, unlike rigid physics-informed neural networks (PINNs).

Result: PIKAN achieves comparable accuracy to DL models on UAV A2G measurement data while providing symbolic and explainable expressions aligned with propagation laws. It uses only 232 parameters (37x lighter than MLP baselines) without sacrificing correlation with measurements.

Conclusion: PIKAN offers an efficient, interpretable, and scalable solution for UAV channel modeling in beyond-5G and 6G networks, bridging the gap between accuracy and explainability.

Abstract: Unmanned aerial vehicle (UAV) communications demand accurate yet interpretable air-to-ground (A2G) channel models that can adapt to nonstationary propagation environments. While deterministic models offer interpretability and deep learning (DL) models provide accuracy, both approaches suffer from either rigidity or a lack of explainability. To bridge this gap, we propose the Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (e.g., free-space path loss, two-ray reflections) into the learning process. Unlike physics-informed neural networks (PINNs), PIKAN is more flexible for applying physical information because it introduces them as flexible inductive biases. Thus, it enables a more flexible training process. Experiments on UAV A2G measurement data show that PIKAN achieves comparable accuracy to DL models while providing symbolic and explainable expressions aligned with propagation laws. Remarkably, PIKAN achieves this performance with only 232 parameters, making it up to 37 times lighter than multilayer perceptron (MLP) baselines with thousands of parameters, without sacrificing correlation with measurements and also providing symbolic expressions. These results highlight PIKAN as an efficient, interpretable, and scalable solution for UAV channel modelling in beyond-5G and 6G networks.

[700] Post-hoc Stochastic Concept Bottleneck Models

Wiktor Jan Hoffmann, Sonia Laguna, Moritz Vandenhirtz, Emanuele Palumbo, Julia E. Vogt

Main category: cs.LG

TL;DR: PSCBMs add a lightweight covariance-prediction module to pre-trained CBMs to model concept dependencies without retraining, improving accuracy and intervention performance.

Details

Motivation: While modeling concept dependencies improves CBM performance under interventions, existing approaches require full model retraining which is often infeasible due to data/compute limitations.

Method: Post-hoc Stochastic Concept Bottleneck Models (PSCBMs) augment any pre-trained CBM with a multivariate normal distribution over concepts by adding only a small covariance-prediction module, without retraining the backbone model. Two training strategies are proposed.

Result: PSCBMs consistently match or improve both concept and target accuracy over standard CBMs at test time. They perform much better than CBMs under interventions due to modeling concept dependencies, while remaining far more efficient than retraining from scratch.

Conclusion: PSCBMs provide an efficient, lightweight approach to enhance pre-trained CBMs with concept dependency modeling, enabling improved intervention performance without the need for full model retraining.

Abstract: Concept Bottleneck Models (CBMs) are interpretable models that predict the target variable through high-level human-understandable concepts, allowing users to intervene on mispredicted concepts to adjust the final output. While recent work has shown that modeling dependencies between concepts can improve CBM performance, especially under interventions, such approaches typically require retraining the entire model, which may be infeasible when access to the original data or compute is limited. In this paper, we introduce Post-hoc Stochastic Concept Bottleneck Models (PSCBMs), a lightweight method that augments any pre-trained CBM with a multivariate normal distribution over concepts by adding only a small covariance-prediction module, without retraining the backbone model. We propose two training strategies and show on real-world data that PSCBMs consistently match or improve both concept and target accuracy over standard CBMs at test time. Furthermore, we show that due to the modeling of concept dependencies, PSCBMs perform much better than CBMs under interventions, while remaining far more efficient than retraining a similar stochastic model from scratch.

[701] Causal Ordering for Structure Learning from Time Series

Pedro P. Sanchez, Damian Machlanski, Steven McDonagh, Sotirios A. Tsaftaris

Main category: cs.LG

TL;DR: DOTS is a diffusion-based causal discovery method for time series that uses multiple causal orderings instead of a single one to better recover the transitive closure of the underlying DAG, outperforming state-of-the-art baselines in accuracy and scalability.

Details

Motivation: Causal discovery in time series faces combinatorial complexity challenges as variables and time points increase. Traditional ordering-based methods are limited by using a single causal ordering, which restricts representational capacity and introduces spurious artifacts.

Method: DOTS (Diffusion Ordered Temporal Structure) leverages multiple valid causal orderings rather than a single one. It uses diffusion-based causal discovery with score matching and diffusion processes for efficient Hessian estimation, operating under standard assumptions like stationarity and additive noise models.

Result: On synthetic benchmarks (3-6 variables, 200-5,000 samples), DOTS improves mean window-graph F1 from 0.63 (best baseline) to 0.81. On the CausalTime real-world benchmark (20-36 variables), DOTS achieves the highest average summary-graph F1 while halving runtime compared to graph-optimization methods.

Conclusion: DOTS establishes itself as a scalable and accurate solution for temporal causal discovery by effectively recovering the transitive closure of underlying DAGs through multiple orderings, outperforming existing methods in both synthetic and real-world evaluations.

Abstract: Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ($d{=}!3-!6$ variables, $T{=}200!-!5{,}000$ samples), DOTS improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real-world benchmark ($d{=}20!-!36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph $F1$ while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.

[702] Blade: A Derivative-free Bayesian Inversion Method using Diffusion Priors

Hongkai Zheng, Austin Wang, Zihui Wu, Zhengyu Huang, Ricardo Baptista, Yisong Yue

Main category: cs.LG

TL;DR: Blade is a derivative-free Bayesian inversion method using interacting particles with diffusion model priors, achieving accurate posteriors for nonlinear forward models.

Details

Motivation: Many science and engineering applications require Bayesian inversion when forward model derivatives are computationally challenging or impractical to compute, creating a need for effective derivative-free methods.

Method: Blade uses an ensemble of interacting particles with data-driven priors based on diffusion models, handling nonlinear forward models through black-box access without requiring derivatives.

Result: The method achieves superior performance compared to existing derivative-free Bayesian inversion methods on various inverse problems, including challenging highly nonlinear fluid dynamics, with theoretical non-asymptotic convergence analysis.

Conclusion: Blade provides an effective derivative-free Bayesian inversion framework that produces accurate and well-calibrated posteriors for complex inverse problems where derivative computation is impractical.

Abstract: Derivative-free Bayesian inversion is an important task in many science and engineering applications, particularly when computing the forward model derivative is computationally and practically challenging. In this paper, we introduce Blade, which can produce accurate and well-calibrated posteriors for Bayesian inversion using an ensemble of interacting particles. Blade leverages powerful data-driven priors based on diffusion models, and can handle nonlinear forward models that permit only black-box access (i.e., derivative-free). Theoretically, we establish a non-asymptotic convergence analysis to characterize the effects of forward model and prior estimation errors. Empirically, Blade achieves superior performance compared to existing derivative-free Bayesian inversion methods on various inverse problems, including challenging highly nonlinear fluid dynamics.

[703] A Practitioner’s Guide to Kolmogorov-Arnold Networks

Amir Noorizadegan, Sifan Wang, Leevan Ling, Juan P. Dominguez-Morales

Main category: cs.LG

TL;DR: A comprehensive review of Kolmogorov-Arnold Networks (KANs) that systematically organizes the rapidly expanding literature around three core themes and provides practical guidance for researchers.

Details

Motivation: KANs have emerged as a structured alternative to MLPs, inspired by the Kolmogorov superposition theorem, and there's a need to systematically organize the rapidly expanding literature on this topic.

Method: The review organizes KAN literature around three core themes: (1) clarifying relationships between KANs and Kolmogorov superposition theory, MLPs, and classical kernel methods; (2) analyzing basis functions as a central design axis; (3) summarizing advances in accuracy, efficiency, regularization, and convergence.

Result: Provides a structured overview of KAN research, practical “Choose-Your-KAN” guidance, identification of open research challenges, and a GitHub repository as a structured reference for ongoing research.

Conclusion: KANs represent a promising alternative to MLPs with structured mathematical foundations, and this review serves as a comprehensive resource for researchers while outlining future research directions in this emerging field.

Abstract: Kolmogorov-Arnold Networks (KANs), whose design is inspired-rather than dictated-by the Kolmogorov superposition theorem, have emerged as a structured alternative to MLPs. This review provides a systematic and comprehensive overview of the rapidly expanding KAN literature. The review is organized around three core themes: (i) clarifying the relationships between KANs and Kolmogorov superposition theory (KST), MLPs, and classical kernel methods; (ii) analyzing basis functions as a central design axis; and (iii) summarizing recent advances in accuracy, efficiency, regularization, and convergence. Finally, we provide a practical “Choose-Your-KAN” guide and outline open research challenges and future directions. The accompanying GitHub repository serves as a structured reference for ongoing KAN research.

[704] Functional Distribution Networks (FDN)

Omer Haq

Main category: cs.LG

TL;DR: Functional Distribution Networks (FDN) - a method for input-adaptive uncertainty estimation in regression that maintains calibration under distribution shift through weight-space distributions and beta-ELBO training.

Details

Motivation: Modern probabilistic regressors often remain overconfident under distribution shift, failing to properly adapt uncertainty estimates when facing out-of-distribution (OOD) data.

Method: FDN uses input-conditioned distributions over network weights that induce predictive mixtures with adaptive dispersion. Trained with beta-ELBO and Monte Carlo sampling. Includes evaluation protocol separating interpolation from extrapolation with OOD sanity checks.

Result: Benchmarked against Bayesian, ensemble, dropout, and hypernetwork baselines under matched parameter/update budgets. Assessed accuracy, calibration, and shift-awareness with standard diagnostics.

Conclusion: FDN framework and evaluation protocol aim to make OOD-aware, well-calibrated neural regression practical and modular for real-world applications.

Abstract: Modern probabilistic regressors often remain overconfident under distribution shift. We present Functional Distribution Networks (FDN), an input-conditioned distribution over network weights that induces predictive mixtures whose dispersion adapts to the input. FDN is trained with a beta-ELBO and Monte Carlo sampling. We further propose an evaluation protocol that cleanly separates interpolation from extrapolation and stresses OOD sanity checks (e.g., that predictive likelihood degrades under shift while in-distribution accuracy and calibration are maintained). On standard regression tasks, we benchmark against strong Bayesian, ensemble, dropout, and hypernetwork baselines under matched parameter and update budgets, and assess accuracy, calibration, and shift-awareness with standard diagnostics. Together, the framework and protocol aim to make OOD-aware, well-calibrated neural regression practical and modular.

[705] Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape

Xinyuan Song, Ziye Ma

Main category: cs.LG

TL;DR: The paper studies how robust loss functions (based on nonparametric regression) improve optimization landscape and robustness in noisy matrix sensing compared to MSE loss, especially for non-Gaussian/heavy-tailed noise.

Details

Motivation: MSE loss is unreliable for non-Gaussian or heavy-tailed noise in regression tasks. There's a need for robust loss functions that can handle diverse noise distributions while maintaining good optimization properties.

Method: Adopt a robust loss based on nonparametric regression using kernel-based estimate of residual density and maximizing estimated log-likelihood. Analyze how this loss reshapes optimization landscape by examining upper-bound of RIP constants for spurious local minima to disappear.

Result: The robust loss coincides with MSE under Gaussian errors but remains stable under more general settings. It excels at handling large noise and remains robust across diverse noise distributions. The loss improves optimization landscape by affecting RIP constants.

Conclusion: Simply changing the loss function can enhance robustness of machine learning tasks. The work provides an intuitive analytical framework for understanding how loss functions affect optimization landscape and robustness in non-convex problems.

Abstract: In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.

[706] Dynamic Graph Neural Networks for Physiological Based Pharmacokinetic Modeling: A Novel Data Driven Approach to Drug Concentration Prediction

Su Liu, Xin Hu, Shurong Wen, Chengyi Chen, Jiaqi Liu, Lanruo Wang, Jiexi Xu

Main category: cs.LG

TL;DR: Dynamic Graph Neural Network outperforms traditional MLP and LSTM for data-driven PBPK modeling by capturing inter-organ interactions through physiological graph structure.

Details

Motivation: Traditional PBPK models use ODEs with simplifying assumptions that limit their ability to capture nonlinear and system-level physiological interactions between organs.

Method: Proposed a Dynamic Graph Neural Network that models inter-organ interactions through recurrent message passing on a physiological graph, compared against MLP and LSTM baselines.

Result: Dynamic GNN achieved lowest MAPE (15.7%) and highest R² (0.9342), showing better capture of inter-organ pharmacokinetic relationships despite slightly higher absolute error than MLP.

Conclusion: Structure-aware modeling is important for PBPK applications, and Dynamic GNN offers a scalable, equation-free alternative for data-driven pharmacokinetic prediction.

Abstract: Physiologically Based Pharmacokinetic (PBPK) modeling is a key tool in drug development for predicting drug concentration dynamics across organs. Traditional PBPK approaches rely on ordinary differential equations with simplifying assumptions that limit their ability to capture nonlinear and system-level physiological interactions. In this work, we investigate data-driven PBPK modeling using deep learning. We implement two baseline architectures – a multilayer perceptron (MLP) and a long short-term memory (LSTM) network – and propose a Dynamic Graph Neural Network (Dynamic GNN) that explicitly models inter-organ interactions through recurrent message passing on a physiological graph. Experiments on a multi-organ pharmacokinetic dataset show that the Dynamic GNN achieves the lowest mean absolute percentage error (MAPE) of 15.7% among all models, demonstrating improved relative accuracy despite slightly higher absolute error compared to the MLP baseline. The model attains an R2 of 0.9342 with more stable error behavior and better captures inter-organ pharmacokinetic relationships. These results highlight the importance of structure-aware modeling for PBPK applications and demonstrate that the proposed Dynamic GNN offers a scalable, equation-free alternative for data-driven pharmacokinetic prediction.

[707] Semi-supervised and unsupervised learning for health indicator extraction from guided waves in aerospace composite structures

James Josep Perry, Pablo Garcia-Conde Ortiz, George Konstantinou, Cornelie Vergouwen, Edlyn Santha Kumaran, Morteza Moradi

Main category: cs.LG

TL;DR: A data-driven framework using two learning approaches (semi-supervised and unsupervised) with multi-domain signal processing to extract reliable health indicators for composite structures under fatigue loading.

Details

Motivation: Extracting reliable health indicators for aerospace composite structures is challenging due to material variability, stochastic damage evolution, diverse damage modes, manufacturing defects, and in-service incidents. Ground-truth health indicators are unavailable, making traditional approaches insufficient.

Method: Two integrated approaches: (1) Diversity-DeepSAD (semi-supervised anomaly detection) with continuous auxiliary labels as damage proxies, overcoming binary label limitations; (2) DTC-VAE (degradation-trend-constrained variational autoencoder) with monotonicity constraint. Uses guided waves with multiple excitation frequencies, explores time/frequency/time-frequency representations, and fuses per-frequency HIs via unsupervised ensemble learning.

Result: Diversity-DeepSAD achieved 81.6% performance with FFT features, while DTC-VAE delivered the most consistent health indicators with 92.3% performance, outperforming existing baselines.

Conclusion: The proposed data-driven framework successfully extracts reliable health indicators for composite structures, with DTC-VAE showing superior performance and consistency, enabling better condition monitoring and maintenance planning for aerospace applications.

Abstract: Health indicators (HIs) are central to diagnosing and prognosing the condition of aerospace composite structures, enabling efficient maintenance and operational safety. However, extracting reliable HIs remains challenging due to variability in material properties, stochastic damage evolution, and diverse damage modes. Manufacturing defects (e.g., disbonds) and in-service incidents (e.g., bird strikes) further complicate this process. This study presents a comprehensive data-driven framework that learns HIs via two learning approaches integrated with multi-domain signal processing. Because ground-truth HIs are unavailable, a semi-supervised and an unsupervised approach are proposed: (i) a diversity deep semi-supervised anomaly detection (Diversity-DeepSAD) approach augmented with continuous auxiliary labels used as hypothetical damage proxies, which overcomes the limitation of prior binary labels that only distinguish healthy and failed states while neglecting intermediate degradation, and (ii) a degradation-trend-constrained variational autoencoder (DTC-VAE), in which the monotonicity criterion is embedded via an explicit trend constraint. Guided waves with multiple excitation frequencies are used to monitor single-stiffener composite structures under fatigue loading. Time, frequency, and time-frequency representations are explored, and per-frequency HIs are fused via unsupervised ensemble learning to mitigate frequency dependence and reduce variance. Using fast Fourier transform features, the augmented Diversity-DeepSAD model achieved 81.6% performance, while DTC-VAE delivered the most consistent HIs with 92.3% performance, outperforming existing baselines.

[708] Controllable Flow Matching for Online Reinforcement Learning

Bin Wang, Boxiang Tao, Haifeng Jing, Hongbo Dou, Zijian Wang

Main category: cs.LG

TL;DR: CtrlFlow is a model-based RL method that uses conditional flow matching to directly model optimal trajectory distributions instead of environment dynamics, improving sample efficiency and generalization.

Details

Motivation: Traditional MBRL methods suffer from model error accumulation over long-horizon rollouts, leading to instability. There's a need for methods that avoid explicit dynamics modeling while maintaining data efficiency.

Method: Proposes CtrlFlow using conditional flow matching to directly model trajectory distributions from initial states to high-return terminal states. Minimizes control energy via non-linear Controllability Gramian Matrix for optimal trajectory sampling.

Result: Outperforms dynamics models on MuJoCo benchmarks in online settings, achieves superior sample efficiency compared to standard MBRL methods, and generates diverse trajectory data that enhances policy robustness and cross-task generalization.

Conclusion: CtrlFlow provides a promising alternative to traditional dynamics modeling in MBRL by directly learning optimal trajectory distributions, addressing model error accumulation while maintaining data efficiency and improving generalization.

Abstract: Model-based reinforcement learning (MBRL) typically relies on modeling environment dynamics for data efficiency. However, due to the accumulation of model errors over long-horizon rollouts, such methods often face challenges in maintaining modeling stability. To address this, we propose CtrlFlow, a trajectory-level synthetic method using conditional flow matching (CFM), which directly modeling the distribution of trajectories from initial states to high-return terminal states without explicitly modeling the environment transition function. Our method ensures optimal trajectory sampling by minimizing the control energy governed by the non-linear Controllability Gramian Matrix, while the generated diverse trajectory data significantly enhances the robustness and cross-task generalization of policy learning. In online settings, CtrlFlow demonstrates the better performance on common MuJoCo benchmark tasks than dynamics models and achieves superior sample efficiency compared to standard MBRL methods.

[709] Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Saiyong Yang, Yunfang Wu

Main category: cs.LG

TL;DR: LTE (Learning to reason from Trial and Error) is a new RLVR approach that improves language model reasoning by having models learn from their own mistakes, outperforming existing methods without needing external expert guidance.

Details

Motivation: Existing RLVR approaches suffer from exploration stagnation - they only train on models' own on-policy responses, limiting learning to the model's initial capability. Off-policy solutions exist but require external expert guidance which is scarce and not scalable.

Method: Proposes LTE (Learning to reason from Trial and Error), which hints language models with their previously self-made mistakes during training, eliminating the need for external expert guidance while enabling better learning from errors.

Result: LTE outperforms normal GRPO by 5.02 in Pass@1 and 9.96 in Pass@k on average across six mathematical reasoning benchmarks for Qwen3-8B-Base. It even beats methods requiring external gold solutions after aligning experimental setups, and successfully mitigates exploration stagnation.

Conclusion: LTE effectively addresses exploration stagnation in RLVR by enabling language models to learn from their own trial-and-error process, enhancing both exploitation and exploration during training without requiring external expert guidance.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of language models (LMs) recently. However, existing RLVR approaches merely train LMs based on their own generated on-policy responses and are constrained by the initial capability of LMs, thus prone to exploration stagnation, in which LMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems, but relies on external expert guidance that is limited in availability and scalability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach that hints LMs with their previously self-made mistakes, not requiring any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 5.02 in Pass@1 and 9.96 in Pass@k on average across six mathematical reasoning benchmarks for Qwen3-8B-Base and even performs better than methods that require external gold solutions as guidance after aligning the experimental setup. Further analysis confirms that LTE successfully mitigates exploration stagnation and enhances both exploitation and exploration during training. Our code is available at https://anonymous.4open.science/r/Learning-from-Trial-and-Error.

[710] Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework

Xiaoyu Fan, Lin Guo, Ruizhen Jia, Yang Tian, Zhihao Yang, Weihao Li, Boxue Tian

Main category: cs.LG

TL;DR: MolRuleLoss is a framework that improves molecular property prediction models by incorporating substructure substitution rules into the loss function, boosting accuracy and generalizability for both in-distribution and out-of-distribution molecules.

Details

Motivation: AI models for molecular property prediction often have poor accuracy in regression tasks and perform catastrophically poorly for out-of-distribution molecules, limiting their practical utility in drug discovery.

Method: MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into molecular property regression models’ loss functions, using domain knowledge about how molecular substructure changes affect properties.

Result: Significant performance improvements across multiple tasks: 2.6-33.3% RMSE reductions for lipophilicity, solubility, and solvation-free energy; dramatic improvement for OOD molecular weight prediction (RMSE from 29.507 to 0.007); better generalizability for activity cliff and OOD molecules.

Conclusion: MolRuleLoss effectively boosts prediction accuracy and generalizability of molecular property regression models by incorporating chemical domain knowledge, supporting diverse applications in cheminformatics and AI-aided drug discovery.

Abstract: Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM’s loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6-33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss improved the generalizability of MPRMs for “activity cliff” molecules in a lipophilicity prediction task and improved the generalizability of MPRMs for OOD molecules in a melting point prediction task. In a molecular weight prediction task for OOD molecules, MolRuleLoss reduced the RMSE value of a GEM model from 29.507 to 0.007. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs is positively correlated with an MPRM’s error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.

[711] Grounded Test-Time Adaptation for LLM Agents

Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

Main category: cs.LG

TL;DR: LLM agents struggle with novel environments due to syntax/semantic mismatches. Two adaptation strategies proposed: online distributional adaptation for format alignment, and deployment-time dynamics grounding for learning causal dynamics. Both improve generalization with minimal cost, especially dynamics grounding for complex environments.

Details

Motivation: LLM-based agents fail to generalize to novel environments due to mismatch between pre-training and test-time conditions. Two failure modes: syntactic misunderstanding of environment-specific formats and semantic misunderstanding of state-transition dynamics that are only revealed at deployment.

Method: Two complementary strategies: 1) Online distributional adaptation - learns lightweight adaptation vector to bias model’s output distribution for environment response format alignment. 2) Deployment-time dynamics grounding - uses persona-driven exploration to systematically probe and learn environment’s causal dynamics before task execution, creating nonparametric world model.

Result: Both strategies effective across diverse benchmarks (function calling, web navigation) with minimal computational cost. Dynamics grounding particularly effective in complex environments with unpredictable dynamics. On WebArena multi-site split, success rate increased from 2% to 23%.

Conclusion: The proposed adaptation strategies provide robust path toward more generalizable LLM agents. Dynamics grounding addresses fundamental challenge of learning environment dynamics at deployment, significantly improving performance in novel, complex environments.

Abstract: Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model’s output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment’s causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent’s success rate from 2% to 23%.

[712] Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Alexander W. Goodall, Edwin Hamel-De le Court, Francesco Belardinelli

Main category: cs.LG

TL;DR: The paper proposes using well-designed behavior policies to collect off-policy data for lower-variance return estimates in online RL, improving sample efficiency over traditional on-policy methods.

Details

Motivation: Many RL algorithms suffer from poor sample efficiency and training instability due to high-variance return estimates. Traditional on-policy data collection is not variance-optimal, and recent off-policy evaluation results show that well-designed behavior policies can provide provably lower-variance estimates.

Method: Extends key insights from off-policy evaluation to online RL setting. Uses a single behavior policy (not multiple workers) to collect data for policy improvement with provably lower-variance return estimates. Extends two policy-gradient methods with this regime.

Result: Demonstrates better sample efficiency and performance over a diverse set of environments compared to traditional approaches.

Conclusion: Well-designed behavior policies for off-policy data collection can provide lower-variance return estimates, leading to improved sample efficiency and performance in online reinforcement learning.

Abstract: Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatch between the workers and policy is corrected in a mathematically sound way. Here we consider only one worker - the behaviour policy, which is used to collect data for policy improvement, with provably lower variance return estimates. In our experiments we extend two policy-gradient methods with this regime, demonstrating better sample efficiency and performance over a diverse set of environments.

[713] Optimal Look-back Horizon for Time Series Forecasting in Federated Learning

Dahao Tang, Nan Yang, Yanli Li, Zhiyu Zhu, Zhibo Jin, Dong Yuan

Main category: cs.LG

TL;DR: A theoretical framework for adaptive look-back horizon selection in federated time series forecasting using intrinsic space representation and loss decomposition analysis.

Details

Motivation: Look-back horizon selection is challenging in federated time series forecasting due to decentralized, heterogeneous, non-independent data. Existing methods focus on centralized settings and don't address federated learning constraints.

Method: Proposes a principled framework with: 1) Synthetic data generator capturing temporal structures and client heterogeneity, 2) Transformation to intrinsic representation space, 3) Decomposition of forecasting loss into Bayesian (irreducible uncertainty) and approximation (finite-sample effects) terms.

Result: Analysis shows increasing look-back horizon improves deterministic pattern identifiability but increases approximation error. Total loss minimized at smallest horizon where irreducible loss saturates while approximation loss rises. Provides theoretical foundation for adaptive horizon selection.

Conclusion: The framework offers rigorous theoretical foundation for adaptive horizon selection in federated time series forecasting, addressing the trade-off between pattern identifiability and approximation error in decentralized, heterogeneous settings.

Abstract: Selecting an appropriate look-back horizon remains a fundamental challenge in time series forecasting (TSF), particularly in the federated learning scenarios where data is decentralized, heterogeneous, and often non-independent. While recent work has explored horizon selection by preserving forecasting-relevant information in an intrinsic space, these approaches are primarily restricted to centralized and independently distributed settings. This paper presents a principled framework for adaptive horizon selection in federated time series forecasting through an intrinsic space formulation. We introduce a synthetic data generator (SDG) that captures essential temporal structures in client data, including autoregressive dependencies, seasonality, and trend, while incorporating client-specific heterogeneity. Building on this model, we define a transformation that maps time series windows into an intrinsic representation space with well-defined geometric and statistical properties. We then derive a decomposition of the forecasting loss into a Bayesian term, which reflects irreducible uncertainty, and an approximation term, which accounts for finite-sample effects and limited model capacity. Our analysis shows that while increasing the look-back horizon improves the identifiability of deterministic patterns, it also increases approximation error due to higher model complexity and reduced sample efficiency. We prove that the total forecasting loss is minimized at the smallest horizon where the irreducible loss starts to saturate, while the approximation loss continues to rise. This work provides a rigorous theoretical foundation for adaptive horizon selection for time series forecasting in federated learning.

[714] 3D Dynamic Radio Map Prediction Using Vision Transformers for Low-Altitude Wireless Networks

Nguyen Duc Minh Quang, Chang Liu, Huy-Trung Nguyen, Shuangyang Li, Derrick Wing Kwan Ng, Wei Xiang

Main category: cs.LG

TL;DR: A 3D dynamic radio map framework using Vision Transformer and Transformer modules to predict spatio-temporal power evolution in low-altitude UAV networks.

Details

Motivation: Low-altitude wireless networks face challenges with reliable connectivity due to 3D UAV mobility, time-varying user density, and limited power. Existing radio maps are static/offline and fail to capture real-time power variations and spatio-temporal dependencies in dynamic multi-UAV environments.

Method: Proposes a 3D dynamic radio map (3D-DRM) framework with two main components: 1) Vision Transformer encoder to extract high-dimensional spatial representations from 3D radio maps, and 2) Transformer-based module to model sequential dependencies and predict future power distributions.

Result: The 3D-DRM framework accurately captures fast-varying power dynamics and substantially outperforms baseline models in both radio map reconstruction and short-term prediction tasks.

Conclusion: The proposed 3D dynamic radio map framework effectively addresses the limitations of static radio maps by learning and predicting spatio-temporal power evolution in dynamic low-altitude UAV networks, enabling better radio-aware network optimization.

Abstract: Low-altitude wireless networks (LAWN) are rapidly expanding with the growing deployment of unmanned aerial vehicles (UAVs) for logistics, surveillance, and emergency response. Reliable connectivity remains a critical yet challenging task due to three-dimensional (3D) mobility, time-varying user density, and limited power budgets. The transmit power of base stations (BSs) fluctuates dynamically according to user locations and traffic demands, leading to a highly non-stationary 3D radio environment. Radio maps (RMs) have emerged as an effective means to characterize spatial power distributions and support radio-aware network optimization. However, most existing works construct static or offline RMs, overlooking real-time power variations and spatio-temporal dependencies in multi-UAV networks. To overcome this limitation, we propose a 3D dynamic radio map (3D-DRM) framework that learns and predicts the spatio-temporal evolution of received power. Specially, a Vision Transformer (ViT) encoder extracts high-dimensional spatial representations from 3D RMs, while a Transformer-based module models sequential dependencies to predict future power distributions. Experiments unveil that 3D-DRM accurately captures fast-varying power dynamics and substantially outperforms baseline models in both RM reconstruction and short-term prediction.

[715] Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

Alex Ning, Vainateya Rangaraju, Yen-Ling Kuo

Main category: cs.LG

TL;DR: Visualizing latent state geometries in Transformer-based LLMs using PCA and UMAP reveals separation between attention/MLP components, high initial position norms, and helical positional embeddings.

Details

Motivation: LLMs achieve state-of-the-art results but their internal mechanisms remain difficult to interpret, creating a need for better visualization and understanding of latent state geometries.

Method: Extract layerwise activations at multiple points within Transformer blocks, then apply dimensionality reduction techniques (PCA and UMAP) to visualize latent state geometries in GPT-2 and LLaMa models.

Result: Uncovered clear separation between attention and MLP component outputs across intermediate layers, identified high norm of latent states at initial sequence position, visualized layerwise evolution of latent states, demonstrated helical structure of GPT-2’s positional embeddings, and showed sequence-wise geometric patterns in LLaMa.

Conclusion: Dimensionality reduction techniques effectively reveal previously undocumented geometric patterns in Transformer-based LLMs, providing new insights into their internal mechanisms and making visualization tools publicly available for further research.

Abstract: Large language models (LLMs) achieve state-of-the-art results across many natural language tasks, but their internal mechanisms remain difficult to interpret. In this work, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction. We capture layerwise activations at multiple points within Transformer blocks and enable systematic analysis through Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). We demonstrate experiments on GPT-2 and LLaMa models, where we uncover interesting geometric patterns in latent space. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge. We also characterize the high norm of latent states at the initial sequence position and visualize the layerwise evolution of latent states. Additionally, we demonstrate the high-dimensional helical structure of GPT-2’s positional embeddings and the sequence-wise geometric patterns in LLaMa. We make our code available at https://github.com/Vainateya/Feature_Geometry_Visualization.

[716] Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz

Main category: cs.LG

TL;DR: F2D2 is a framework that jointly distills both sampling and likelihood evaluation in flow-based models, reducing neural function evaluations by 100x while maintaining accurate likelihood computation and sample quality.

Details

Motivation: Current diffusion and flow-based models require hundreds to thousands of neural function evaluations for likelihood computation, creating a computational bottleneck. While distillation methods accelerate sampling, they sacrifice likelihood tractability or still require expensive integration.

Method: Fast Flow Joint Distillation (F2D2) leverages the insight that in continuous normalizing flows, sampling and likelihood ODEs share a common velocity field. The method jointly distills both the sampling trajectory and cumulative divergence using a single model with an additional divergence prediction head, making it modular and compatible with existing few-step flow models.

Result: F2D2 reduces NFEs for both sampling and likelihood evaluation by two orders of magnitude while maintaining accurate log-likelihood computation and high sample quality. The lightweight self-guidance application enables a 2-step MeanFlow to outperform a 1024-step flow matching model with only one additional backward NFE.

Conclusion: F2D2 solves the long-standing computational bottleneck in flow-based generative models by enabling efficient few-step likelihood evaluation without sacrificing accuracy or sample quality, opening up new applications for likelihood-based capabilities in modern generative models.

Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today’s best generative models – diffusion and flow-based models – still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2’s capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow to outperform a 1024 step flow matching model with only a single additional backward NFE.

[717] Interaction Tensor SHAP

Hiroki Hasegawa, Yukihiko Okada

Main category: cs.LG

TL;DR: IT-SHAP reformulates the Shapley Taylor Interaction Index using tensor algebra to make its computational structure explicit, enabling efficient computation when the value function has Tensor Train representation, while proving intractability without such structure.

Details

Motivation: The Shapley Taylor Interaction Index (STII) extends Shapley values to higher-order interactions but has an exponential combinatorial definition that makes direct computation intractable at scale. There's a need for scalable methods to interpret high-dimensional models through higher-order interaction analysis.

Method: Proposes Interaction Tensor SHAP (IT-SHAP), a tensor algebraic formulation of STII that reformulates it as a linear transformation acting on a value function. Derives an explicit algebraic representation of the weight tensor, showing it has multilinear structure induced by discrete finite difference operators. Analyzes computational complexity under different tensor representations.

Result: 1) Establishes exact Tensor Train representation of STII weight tensor. 2) Develops parallelizable evaluation algorithm with explicit complexity bounds under Tensor Train assumption (NC² complexity). 3) Proves computational intractability (P#-hard) without structural assumptions. Shows computational difficulty depends on algebraic representation rather than interaction index itself.

Conclusion: The computational difficulty of higher-order interaction analysis is determined by the underlying algebraic representation rather than by the interaction index itself. This provides a theoretical foundation for scalable interpretation of high-dimensional models when appropriate tensor structures are present.

Abstract: This study proposes Interaction Tensor SHAP (IT-SHAP), a tensor algebraic formulation of the Shapley Taylor Interaction Index (STII) that makes its computational structure explicit. STII extends the Shapley value to higher order interactions, but its exponential combinatorial definition makes direct computation intractable at scale. We reformulate STII as a linear transformation acting on a value function and derive an explicit algebraic representation of its weight tensor. This weight tensor is shown to possess a multilinear structure induced by discrete finite difference operators. When the value function admits a Tensor Train representation, higher order interaction indices can be computed in the parallel complexity class NC squared. In contrast, under general tensor network representations without structural assumptions, the same computation is proven to be P sharp hard. The main contributions are threefold. First, we establish an exact Tensor Train representation of the STII weight tensor. Second, we develop a parallelizable evaluation algorithm with explicit complexity bounds under the Tensor Train assumption. Third, we prove that computational intractability is unavoidable in the absence of such structure. These results demonstrate that the computational difficulty of higher order interaction analysis is determined by the underlying algebraic representation rather than by the interaction index itself, providing a theoretical foundation for scalable interpretation of high dimensional models.

[718] Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon

Main category: cs.LG

TL;DR: NEUBAY introduces a Bayesian approach to offline RL instead of conservatism, using long-horizon planning with world model posteriors to handle epistemic uncertainty, achieving SOTA on 7 datasets.

Details

Motivation: Current offline RL methods rely on conservatism (penalizing out-of-dataset actions or restricting horizons), but this may not be universally effective. The paper questions this principle and explores a complementary Bayesian perspective to handle epistemic uncertainty in offline data.

Method: Proposes NEUBAY algorithm based on neutral Bayesian principle: 1) Models posterior distribution over plausible world models, 2) Trains history-dependent agent to maximize expected rewards, 3) Uses long-horizon planning (hundreds of steps) with design choices to control compounding errors, 4) Enables test-time generalization through Bayesian uncertainty modeling.

Result: NEUBAY matches or surpasses leading conservative algorithms on D4RL and NeoRL benchmarks, achieving new state-of-the-art on 7 datasets. Excels on low-quality datasets where conservatism fails. Successfully uses rollout horizons of several hundred steps, contrary to common practice.

Conclusion: NEUBAY demonstrates that Bayesian approaches without conservatism can be effective for offline RL, especially with long-horizon planning. Provides foundation for new practical direction in offline and model-based RL, with performance dependent on dataset quality and coverage.

Abstract: Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting rollout horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale this principle to realistic tasks and show that long-horizon planning is critical for reducing value overestimation once conservatism is removed. To make this feasible, we introduce key design choices for performing and learning from long-horizon rollouts while controlling compounding errors. These yield our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with rollout horizons of several hundred steps, contrary to dominant practice. Finally, we characterize datasets by quality and coverage, showing when NEUBAY is preferable to conservative methods. Together, we argue NEUBAY lays the foundation for a new practical direction in offline and model-based RL.

[719] Renormalizable Spectral-Shell Dynamics as the Origin of Neural Scaling Laws

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper derives macroscopic training dynamics from gradient descent in function space, showing that error evolves via a time-dependent operator. Using spectral analysis and coarse-graining, it explains neural scaling laws and double descent through self-similar solutions.

Details

Motivation: To understand the simple macroscopic structure behind deep network training despite nonlinear optimization dynamics, and to unify lazy training and feature learning within a single theoretical framework.

Method: Derive training dynamics from gradient descent in function space, use Kato perturbation theory to get modewise ODEs, introduce logarithmic spectral-shell coarse-graining, and analyze shell energy dynamics with renormalizable shell-dynamics assumption.

Result: Obtains exact system of coupled ODEs in eigenbasis, shows microscopic interactions cancel within shells, derives self-similar solutions with moving resolution frontier, and explains neural scaling laws and double descent phenomena.

Conclusion: The framework unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics, providing a comprehensive theoretical explanation for neural scaling laws and double descent.

Abstract: Neural scaling laws and double-descent phenomena suggest that deep-network training obeys a simple macroscopic structure despite highly nonlinear optimization dynamics. We derive such structure directly from gradient descent in function space. For mean-squared error loss, the training error evolves as $\dot e_t=-M(t)e_t$ with $M(t)=J_{θ(t)}J_{θ(t)}^{!*}$, a time-dependent self-adjoint operator induced by the network Jacobian. Using Kato perturbation theory, we obtain an exact system of coupled modewise ODEs in the instantaneous eigenbasis of $M(t)$. To extract macroscopic behavior, we introduce a logarithmic spectral-shell coarse-graining and track quadratic error energy across shells. Microscopic interactions within each shell cancel identically at the energy level, so shell energies evolve only through dissipation and external inter-shell interactions. We formalize this via a \emph{renormalizable shell-dynamics} assumption, under which cumulative microscopic effects reduce to a controlled net flux across shell boundaries. Assuming an effective power-law spectral transport in a relevant resolution range, the shell dynamics admits a self-similar solution with a moving resolution frontier and explicit scaling exponents. This framework explains neural scaling laws and double descent, and unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics.

[720] PathFinder: Advancing Path Loss Prediction for Single-to-Multi-Transmitter Scenario

Zhijie Zhong, Zhiwen Yu, Pengyu Li, Jianming Lv, C. L. Philip Chen, Min Chen

Main category: cs.LG

TL;DR: PathFinder: A novel deep learning architecture for radio path loss prediction that addresses limitations in environmental modeling, multi-transmitter scenarios, and distribution shift generalization through disentangled feature encoding and attention mechanisms.

Details

Motivation: Current deep learning-based radio path loss prediction methods have three key limitations: 1) passive environmental modeling that overlooks transmitters and key features, 2) focus on single-transmitter scenarios despite real-world multi-transmitter prevalence, and 3) poor generalization under distribution shifts when training/testing environments differ.

Method: Proposes PathFinder architecture with disentangled feature encoding to actively model buildings and transmitters, Mask-Guided Low-rank Attention to independently focus on receiver and building regions, and Transmitter-Oriented Mixup strategy for robust training. Also introduces S2MT-RPP benchmark for evaluating extrapolation from single-to-multi-transmitter scenarios.

Result: PathFinder significantly outperforms state-of-the-art methods, especially in challenging multi-transmitter scenarios, demonstrating superior generalization capabilities under distribution shifts.

Conclusion: PathFinder effectively addresses key limitations in radio path loss prediction through active environmental modeling, attention mechanisms, and robust training strategies, enabling better performance in realistic multi-transmitter scenarios and improved generalization under distribution shifts.

Abstract: Radio path loss prediction (RPP) is critical for optimizing 5G networks and enabling IoT, smart city, and similar applications. However, current deep learning-based RPP methods lack proactive environmental modeling, struggle with realistic multi-transmitter scenarios, and generalize poorly under distribution shifts, particularly when training/testing environments differ in building density or transmitter configurations. This paper identifies three key issues: (1) passive environmental modeling that overlooks transmitters and key environmental features; (2) overemphasis on single-transmitter scenarios despite real-world multi-transmitter prevalence; (3) excessive focus on in-distribution performance while neglecting distribution shift challenges. To address these, we propose PathFinder, a novel architecture that actively models buildings and transmitters via disentangled feature encoding and integrates Mask-Guided Low-rank Attention to independently focus on receiver and building regions. We also introduce a Transmitter-Oriented Mixup strategy for robust training and a new benchmark, single-to-multi-transmitter RPP (S2MT-RPP), tailored to evaluate extrapolation performance (multi-transmitter testing after single-transmitter training). Experimental results show PathFinder outperforms state-of-the-art methods significantly, especially in challenging multi-transmitter scenarios. Our code and project site are publicly available at: https://emorzz1g.github.io/PathFinder/.

[721] Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation

Liam Collins, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Donald Loveland, Leonardo Neves, Neil Shah

Main category: cs.LG

TL;DR: ID and text features in sequential recommendation are complementary, and a simple ensemble of independently trained ID-based and text-based models outperforms complex fusion methods.

Details

Motivation: There's a lack of understanding about the complementarity between ID embeddings and modality (text) features in sequential recommendation. Some works claim modality features make IDs unnecessary, while others use complex fusion strategies without clear justification for the complementarity.

Method: Proposes a simple ensemble method that preserves ID-text complementarity through independent training of ID-based and text-based models, then combines them using ensembling rather than complex fusion architectures.

Result: The simple ensemble method outperforms several competitive sequential recommendation baselines, demonstrating that both ID and text features are necessary for state-of-the-art performance.

Conclusion: Both ID and text features are complementary and necessary for optimal sequential recommendation performance, but complex fusion architectures are not required - simple ensembling of independently trained models is sufficient.

Abstract: Modern Sequential Recommendation (SR) models commonly utilize modality features to represent items, motivated in large part by recent advancements in language and vision modeling. To do so, several works completely replace ID embeddings with modality embeddings, claiming that modality embeddings render ID embeddings unnecessary because they can match or even exceed ID embedding performance. On the other hand, many works jointly utilize ID and modality features, but posit that complex fusion strategies, such as multi-stage training and/or intricate alignment architectures, are necessary for this joint utilization. However, underlying both these lines of work is a lack of understanding of the complementarity of ID and modality features. In this work, we address this gap by studying the complementarity of ID- and text-based SR models. We show that these models do learn complementary signals, meaning that either should provide performance gain when used properly alongside the other. Motivated by this, we propose a new SR method that preserves ID-text complementarity through independent model training, then harnesses it through a simple ensembling strategy. Despite this method’s simplicity, we show it outperforms several competitive SR baselines, implying that both ID and text features are necessary to achieve state-of-the-art SR performance but complex fusion architectures are not.

[722] When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper identifies sufficient conditions for deep learning systems to exhibit power-law scaling within the Generalized Resolution-Shell Dynamics framework, showing that such scaling emerges from structural constraints and time-rescaling covariance rather than renormalizability alone.

Details

Motivation: Power-law scaling is widely observed in deep learning systems but its theoretical origins and scope of validity are not fully understood. The paper aims to identify the structural conditions under which such scaling emerges within the Generalized Resolution-Shell Dynamics framework.

Method: The authors use the Generalized Resolution-Shell Dynamics framework to model learning as spectral energy transport across logarithmic resolution shells. They identify a set of sufficient conditions for renormalizable coarse-grained description, including bounded gradient propagation, weak functional incoherence at initialization, controlled Jacobian evolution, and log-shift invariance of renormalized shell couplings.

Result: The paper shows that power-law scaling does not follow from renormalizability alone, but emerges as a rigidity consequence when log-shift invariance is combined with the intrinsic time-rescaling covariance of gradient flow, forcing the renormalized GRSD velocity field into a power-law form.

Conclusion: Power-law scaling in deep learning systems requires specific structural constraints beyond mere renormalizability, emerging from the combination of log-shift invariance and time-rescaling covariance within the GRSD framework.

Abstract: Empirical power–law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution–Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse–grained dynamical description of training. Within GRSD, power–law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse–grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log–shift invariance of renormalized shell couplings. We further show that power–law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log–shift invariance is combined with the intrinsic time–rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power–law form.

[723] Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

Wenlong Tang

Main category: cs.LG

TL;DR: Multi-agent language framework enables continual strategy evolution without fine-tuning LLM parameters by using external latent vectors updated through environmental interaction and reinforcement feedback.

Details

Motivation: To enable language agents to develop and evolve strategic behaviors over time without the computational cost of fine-tuning model parameters, seeking a low-cost, scalable, and interpretable approach to strategic representation.

Method: Dual-loop architecture: behavior loop adjusts action preferences based on environmental rewards, while language loop updates external latent vectors by reflecting on semantic embeddings of generated text. Latent vectors of abstract concepts are liberated from static semantic representations.

Result: Agents’ latent spaces show clear convergence trajectories under reflection-driven updates with structured shifts at critical moments. System demonstrates emergent ability to implicitly infer and adapt to emotional agents without shared rewards.

Conclusion: External latent space can provide language agents with low-cost, scalable, and interpretable abstract strategic representation without modifying model parameters, enabling continual strategy evolution.

Abstract: This study proposes a multi-agent language framework that enables continual strategy evolution without fine-tuning the language model’s parameters. The core idea is to liberate the latent vectors of abstract concepts from traditional static semantic representations, allowing them to be continuously updated through environmental interaction and reinforcement feedback. We construct a dual-loop architecture: the behavior loop adjusts action preferences based on environmental rewards, while the language loop updates the external latent vectors by reflecting on the semantic embeddings of generated text. Together, these mechanisms allow agents to develop stable and disentangled strategic styles over long-horizon multi-round interactions. Experiments show that agents’ latent spaces exhibit clear convergence trajectories under reflection-driven updates, along with structured shifts at critical moments. Moreover, the system demonstrates an emergent ability to implicitly infer and continually adapt to emotional agents, even without shared rewards. These results indicate that, without modifying model parameters, an external latent space can provide language agents with a low-cost, scalable, and interpretable form of abstract strategic representation.

[724] Subgroup Discovery with the Cox Model

Zachary Izzo, Iain Melvin

Main category: cs.LG

TL;DR: First study of subgroup discovery for survival analysis using Cox models, introducing EPE and CRS metrics and eight algorithms to find interpretable subgroups with high predictive accuracy.

Details

Motivation: To address the lack of methods for finding interpretable subgroups in survival analysis where Cox models perform well, overcoming limitations of existing quality functions for subgroup discovery in this context.

Method: Introduces two key innovations: Expected Prediction Entropy (EPE) for evaluating survival models predicting hazard functions, and Conditional Rank Statistics (CRS) for quantifying individual deviation from subgroup survival distributions. Presents eight algorithms including a main algorithm combining EPE and CRS with theoretical correctness guarantees.

Result: Theoretical analysis shows EPE and CRS solve problems with existing metrics. Empirical evaluation on synthetic and real data demonstrates recovery of ground-truth subgroups in well-specified cases and better model fit compared to naive Cox model fitting. NASA jet engine case study reveals known nonlinearities and validates practical design choices.

Conclusion: The proposed framework successfully addresses subgroup discovery for Cox survival models, providing both theoretical foundations and practical algorithms that outperform baseline approaches and offer meaningful insights in real-world applications.

Abstract: We study the problem of subgroup discovery for survival analysis, where the goal is to find an interpretable subset of the data on which a Cox model is highly accurate. Our work is the first to study this particular subgroup problem, for which we make several contributions. Subgroup discovery methods generally require a “quality function” in order to sift through and select the most advantageous subgroups. We first examine why existing natural choices for quality functions are insufficient to solve the subgroup discovery problem for the Cox model. To address the shortcomings of existing metrics, we introduce two technical innovations: the expected prediction entropy (EPE), a novel metric for evaluating survival models which predict a hazard function; and the conditional rank statistics (CRS), a statistical object which quantifies the deviation of an individual point to the distribution of survival times in an existing subgroup. We study the EPE and CRS theoretically and show that they can solve many of the problems with existing metrics. We introduce a total of eight algorithms for the Cox subgroup discovery problem. The main algorithm is able to take advantage of both the EPE and the CRS, allowing us to give theoretical correctness results for this algorithm in a well-specified setting. We evaluate all of the proposed methods empirically on both synthetic and real data. The experiments confirm our theory, showing that our contributions allow for the recovery of a ground-truth subgroup in well-specified cases, as well as leading to better model fit compared to naively fitting the Cox model to the whole dataset in practical settings. Lastly, we conduct a case study on jet engine simulation data from NASA. The discovered subgroups uncover known nonlinearities/homogeneity in the data, and which suggest design choices which have been mirrored in practice.

[725] Müntz-Szász Networks: Neural Architectures with Learnable Power-Law Bases

Gnankan Landry Regis N’guessan

Main category: cs.LG

TL;DR: MSN replaces fixed activation functions with learnable fractional power bases, achieving superior approximation for singular functions common in physics with fewer parameters.

Details

Motivation: Standard neural networks with fixed activation functions (ReLU, tanh, sigmoid) are poorly suited for approximating functions with singular or fractional power behavior that arises ubiquitously in physics (boundary layers, fracture mechanics, corner singularities).

Method: Introduces Müntz-Szász Networks (MSN) that replace fixed smooth activations with learnable fractional power bases. Each edge computes φ(x) = Σ a_k|x|^μ_k + Σ b_k sign(x)|x|^λ_k, where exponents {μ_k, λ_k} are learned alongside coefficients.

Result: MSN achieves 5-8x lower error than MLPs with 10x fewer parameters on singular target functions. For PINN benchmarks including singular ODE and stiff boundary-layer problems, MSN achieves 3-6x improvement while learning interpretable exponents matching known solution structure.

Conclusion: Theory-guided architectural design can yield dramatic improvements for scientifically-motivated function classes, with MSN inheriting universal approximation from Müntz-Szász theorem and establishing superior approximation rates for singular functions.

Abstract: Standard neural network architectures employ fixed activation functions (ReLU, tanh, sigmoid) that are poorly suited for approximating functions with singular or fractional power behavior, a structure that arises ubiquitously in physics, including boundary layers, fracture mechanics, and corner singularities. We introduce Müntz-Szász Networks (MSN), a novel architecture that replaces fixed smooth activations with learnable fractional power bases grounded in classical approximation theory. Each MSN edge computes $φ(x) = \sum_k a_k |x|^{μ_k} + \sum_k b_k \mathrm{sign}(x)|x|^{λ_k}$, where the exponents ${μ_k, λ_k}$ are learned alongside the coefficients. We prove that MSN inherits universal approximation from the Müntz-Szász theorem and establish novel approximation rates: for functions of the form $|x|^α$, MSN achieves error $\mathcal{O}(|μ- α|^2)$ with a single learned exponent, whereas standard MLPs require $\mathcal{O}(ε^{-1/α})$ neurons for comparable accuracy. On supervised regression with singular target functions, MSN achieves 5-8x lower error than MLPs with 10x fewer parameters. Physics-informed neural networks (PINNs) represent a particularly demanding application for singular function approximation; on PINN benchmarks including a singular ODE and stiff boundary-layer problems, MSN achieves 3-6x improvement while learning interpretable exponents that match the known solution structure. Our results demonstrate that theory-guided architectural design can yield dramatic improvements for scientifically-motivated function classes.

[726] A Universal and Robust Framework for Multiple Gas Recognition Based-on Spherical Normalization-Coupled Mahalanobis Algorithm

Shuai Chen, Yang Song, Chen Wang, Ziran Wang

Main category: cs.LG

TL;DR: The paper proposes SNM-Module, a universal post-processing module for open-set gas recognition that addresses feature distribution shift and unknown gas interference through spherical normalization and Mahalanobis distance.

Details

Motivation: Existing methods for open-set gas recognition fail to handle anisotropic feature distributions and dynamic signal intensity variations, leading to poor performance in real-world E-nose deployments.

Method: Proposes SNM-Module with two key components: 1) cascaded batch and L2 normalization to project features onto a unit hypersphere (eliminating intensity fluctuations), and 2) Mahalanobis distance to construct adaptive ellipsoidal decision boundaries that conform to anisotropic feature geometry.

Result: Transformer+SNM achieves near-theoretical-limit performance: AUROC 0.9977, unknown gas detection rate 99.57% at 5% FPR, 3.0% AUROC improvement over SOTA, 91.0% standard deviation reduction compared to CAC, and exceptional robustness across sensor positions (std < 0.0028).

Conclusion: The SNM-Module effectively addresses the critical challenge of simultaneously achieving high accuracy and high stability in open-set gas recognition, providing solid support for industrial E-nose deployment.

Abstract: Electronic nose (E-nose) systems face two interconnected challenges in open-set gas recognition: feature distribution shift caused by signal drift and decision boundary failure induced by unknown gas interference. Existing methods predominantly rely on Euclidean distance or conventional classifiers, failing to account for anisotropic feature distributions and dynamic signal intensity variations. To address these issues, this study proposes the Spherical Normalization coupled Mahalanobis (SNM) module, a universal post-processing module for open-set gas recognition. First, it achieves geometric decoupling through cascaded batch and L2 normalization, projecting features onto a unit hypersphere to eliminate signal intensity fluctuations. Second, it utilizes Mahalanobis distance to construct adaptive ellipsoidal decision boundaries that conform to the anisotropic feature geometry. The architecture-agnostic SNM-Module seamlessly integrates with mainstream backbones including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Transformer. Experiments on the public Vergara dataset demonstrate that the Transformer+SNM configuration achieves near-theoretical-limit performance in discriminating among multiple target gases, with an AUROC of 0.9977 and an unknown gas detection rate of 99.57% at 5% false positive rate, significantly outperforming state-of-the-art methods with a 3.0% AUROC improvement and 91.0% standard deviation reduction compared to Class Anchor Clustering (CAC). The module maintains exceptional robustness across five sensor positions, with standard deviations below 0.0028. This work effectively addresses the critical challenge of simultaneously achieving high accuracy and high stability in open-set gas recognition, providing solid support for industrial E-nose deployment.

[727] Colorful Pinball: Density-Weighted Quantile Regression for Conditional Guarantee of Conformal Prediction

Qianyi Chen, Bo Li

Main category: cs.LG

TL;DR: Proposes a method to improve conditional coverage in conformal prediction by minimizing mean squared error of conditional coverage through a density-weighted pinball loss and three-headed quantile network.

Details

Motivation: Conformal prediction provides marginal coverage guarantees but struggles with reliable conditional coverage for specific inputs. While exact distribution-free conditional coverage is impossible with finite samples, there's a need to improve conditional coverage of standard conformal procedures beyond relaxed notions.

Method: Derives a density-weighted pinball loss using Taylor expansion, where weights are given by conditional density of conformity score at true quantile. Proposes a three-headed quantile network that estimates weights via finite differences using auxiliary quantile levels at 1-α±δ, then fine-tunes the central quantile by optimizing the weighted loss.

Result: Theoretical analysis provides exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

Conclusion: The proposed approach directly minimizes mean squared error of conditional coverage by refining quantile regression components, offering a practical solution to improve conditional coverage in conformal prediction with theoretical guarantees and empirical validation.

Abstract: While conformal prediction provides robust marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Although exact distribution-free conditional coverage is impossible with finite samples, recent work has focused on improving the conditional coverage of standard conformal procedures. Distinct from approaches that target relaxed notions of conditional coverage, we directly minimize the mean squared error of conditional coverage by refining the quantile regression components that underpin many conformal methods. Leveraging a Taylor expansion, we derive a sharp surrogate objective for quantile regression: a density-weighted pinball loss, where the weights are given by the conditional density of the conformity score evaluated at the true quantile. We propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at (1-α\pm δ), subsequently fine-tuning the central quantile by optimizing the weighted loss. We provide a theoretical analysis with exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

[728] Tubular Riemannian Laplace Approximations for Bayesian Neural Networks

Rodrigo Pereira David

Main category: cs.LG

TL;DR: TRL is a new Bayesian approximation method that models neural network posteriors as probabilistic tubes following low-loss valleys, using Riemannian geometry to separate different uncertainty types, achieving ensemble-level calibration at much lower cost.

Details

Motivation: Standard Laplace approximations struggle with the anisotropic, curved loss surfaces and large symmetry groups in modern deep neural networks. Recent geometric approaches attempt to adapt to this structure but need improvement.

Method: TRL models the posterior as a probabilistic tube following low-loss valleys induced by functional symmetries. It uses Fisher/Gauss-Newton metrics to separate prior-dominated tangential uncertainty from data-dominated transverse uncertainty, operating as a scalable reparametrized Gaussian approximation with implicit curvature estimates.

Result: On ResNet-18 (CIFAR-10 and CIFAR-100), TRL achieves excellent calibration, matching or exceeding Deep Ensembles in terms of Expected Calibration Error (ECE) while requiring only 1/5 of the training cost.

Conclusion: TRL effectively bridges the gap between single-model efficiency and ensemble-grade reliability, offering a practical Bayesian approximation method for modern deep neural networks.

Abstract: Laplace approximations are among the simplest and most practical methods for approximate Bayesian inference in neural networks, yet their Euclidean formulation struggles with the highly anisotropic, curved loss surfaces and large symmetry groups that characterize modern deep models. Recent work has proposed Riemannian and geometric Gaussian approximations to adapt to this structure. Building on these ideas, we introduce the Tubular Riemannian Laplace (TRL) approximation. TRL explicitly models the posterior as a probabilistic tube that follows a low-loss valley induced by functional symmetries, using a Fisher/Gauss-Newton metric to separate prior-dominated tangential uncertainty from data-dominated transverse uncertainty. We interpret TRL as a scalable reparametrised Gaussian approximation that utilizes implicit curvature estimates to operate in high-dimensional parameter spaces. Our empirical evaluation on ResNet-18 (CIFAR-10 and CIFAR-100) demonstrates that TRL achieves excellent calibration, matching or exceeding the reliability of Deep Ensembles (in terms of ECE) while requiring only a fraction (1/5) of the training cost. TRL effectively bridges the gap between single-model efficiency and ensemble-grade reliability.

[729] HOLOGRAPH: Active Causal Discovery via Sheaf-Theoretic Alignment of Large Language Model Priors

Hyunjun Kim

Main category: cs.LG

TL;DR: HOLOGRAPH is a framework that formalizes LLM-guided causal discovery using sheaf theory, representing local causal beliefs as sections of a presheaf over variable subsets, with global causal structure corresponding to global sections and topological obstructions as non-vanishing sheaf cohomology.

Details

Motivation: Causal discovery from observational data is fundamentally limited by identifiability constraints, and existing LLM-based approaches rely on heuristic integration without theoretical grounding.

Method: Formalizes LLM-guided causal discovery through sheaf theory, representing local causal beliefs as sections of a presheaf over variable subsets. Introduces Algebraic Latent Projection for hidden confounders and Natural Gradient Descent on the belief manifold for optimization.

Result: Experiments on synthetic and real-world benchmarks show competitive performance on causal discovery tasks with 50-100 variables. Sheaf-theoretic analysis reveals Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), but Locality axiom fails for larger graphs.

Conclusion: HOLOGRAPH provides rigorous mathematical foundations for LLM-guided causal discovery while achieving competitive performance, revealing fundamental non-local coupling in latent variable projections through sheaf-theoretic analysis.

Abstract: Causal discovery from observational data remains fundamentally limited by identifiability constraints. Recent work has explored leveraging Large Language Models (LLMs) as sources of prior causal knowledge, but existing approaches rely on heuristic integration that lacks theoretical grounding. We introduce HOLOGRAPH, a framework that formalizes LLM-guided causal discovery through sheaf theory–representing local causal beliefs as sections of a presheaf over variable subsets. Our key insight is that coherent global causal structure corresponds to the existence of a global section, while topological obstructions manifest as non-vanishing sheaf cohomology. We propose the Algebraic Latent Projection to handle hidden confounders and Natural Gradient Descent on the belief manifold for principled optimization. Experiments on synthetic and real-world benchmarks demonstrate that HOLOGRAPH provides rigorous mathematical foundations while achieving competitive performance on causal discovery tasks with 50-100 variables. Our sheaf-theoretic analysis reveals that while Identity, Transitivity, and Gluing axioms are satisfied to numerical precision (<10^{-6}), the Locality axiom fails for larger graphs, suggesting fundamental non-local coupling in latent variable projections. Code is available at https://github.com/hyunjun1121/holograph.

[730] Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space

Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang

Main category: cs.LG

TL;DR: DLCM is a hierarchical language modeling framework that compresses tokens into variable-length concepts for more efficient computation, achieving better performance with the same FLOPs.

Details

Motivation: Current LLMs waste capacity on predictable tokens while under-allocating computation to semantically important transitions due to uniform token processing, despite language's non-uniform information density.

Method: DLCM learns semantic boundaries from latent representations to compress tokens into variable-length concepts, creating a hierarchical framework with a compression-aware scaling law and decoupled μP parametrization for stable training.

Result: With 4:1 compression ratio, DLCM reallocates one-third of inference compute to higher-capacity reasoning, achieving +2.69% average improvement across 12 zero-shot benchmarks under matched FLOPs.

Conclusion: Hierarchical compression enables more efficient compute allocation by shifting from token-level to concept-level reasoning, fundamentally changing scaling behavior and improving performance without increasing FLOPs.

Abstract: Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.

[731] Diagnosing Heteroskedasticity and Resolving Multicollinearity Paradoxes in Physicochemical Property Prediction

Malikussaid, Septian Caesar Floresko, Ade Romadhony, Isman Kurniawan, Warih Maharani, Hilal Hudan Nuha

Main category: cs.LG

TL;DR: Linear regression models for logP prediction violate statistical assumptions due to heteroskedasticity, while tree-based ensembles (Random Forest, XGBoost) provide robust, superior performance and reveal molecular weight as the most important predictor despite weak bivariate correlation.

Details

Motivation: Lipophilicity (logP) prediction is crucial for drug discovery, but standard linear regression models often violate statistical assumptions, potentially invalidating reported performance metrics and limiting their reliability in QSAR applications.

Method: Analyzed 426,850 bioactive molecules from curated PubChem, ChEMBL, and eMolecules databases. Used linear regression models with computed XLOGP3 values, tested remediation strategies (Weighted Least Squares, Box-Cox transformation), and compared with tree-based ensemble methods (Random Forest, XGBoost). Applied SHAP analysis to interpret feature importance and resolve multicollinearity issues.

Result: Found severe heteroskedasticity in linear models (4.2-fold variance increase in lipophilic regions). Remediation strategies failed (Breusch-Pagan p < 0.0001). Tree-based methods performed better (Random Forest R²=0.764, XGBoost R²=0.765) and were robust to heteroskedasticity. SHAP revealed molecular weight as most important predictor (mean absolute SHAP=0.573) despite weak bivariate correlation (0.146), with its effect suppressed by confounding with TPSA.

Conclusion: Standard linear models face fundamental challenges for computed lipophilicity prediction due to heteroskedasticity. Tree-based ensemble methods provide superior performance and robustness, while SHAP analysis offers a principled framework for interpreting complex relationships and resolving multicollinearity paradoxes in QSAR applications.

Abstract: Lipophilicity (logP) prediction remains central to drug discovery, yet linear regression models for this task frequently violate statistical assumptions in ways that invalidate their reported performance metrics. We analyzed 426,850 bioactive molecules from a rigorously curated intersection of PubChem, ChEMBL, and eMolecules databases, revealing severe heteroskedasticity in linear models predicting computed logP values (XLOGP3): residual variance increases 4.2-fold in lipophilic regions (logP greater than 5) compared to balanced regions (logP 2 to 4). Classical remediation strategies (Weighted Least Squares and Box-Cox transformation) failed to resolve this violation (Breusch-Pagan p-value less than 0.0001 for all variants). Tree-based ensemble methods (Random Forest R-squared of 0.764, XGBoost R-squared of 0.765) proved inherently robust to heteroskedasticity while delivering superior predictive performance. SHAP analysis resolved a critical multicollinearity paradox: despite a weak bivariate correlation of 0.146, molecular weight emerged as the single most important predictor (mean absolute SHAP value of 0.573), with its effect suppressed in simple correlations by confounding with topological polar surface area (TPSA). These findings demonstrate that standard linear models face fundamental challenges for computed lipophilicity prediction and provide a principled framework for interpreting ensemble models in QSAR applications.

[732] Precision Autotuning for Linear Solvers via Reinforcement Learning

Erin Carson, Xinye Chen

Main category: cs.LG

TL;DR: RL framework for adaptive precision tuning of linear solvers using contextual bandits with Q-learning, demonstrated on iterative refinement for solving linear systems.

Details

Motivation: To develop an automated approach for selecting optimal precision configurations in numerical algorithms that balances computational efficiency with accuracy, advancing mixed-precision methods in scientific computing.

Method: Formulated as contextual bandit problem using Q-learning with discretized state space. Features like condition number and matrix norm map to precision configurations via Q-table, optimized with epsilon-greedy strategy to maximize multi-objective reward balancing accuracy and computational cost.

Result: Empirical results show effective precision selection reduces computational cost while maintaining accuracy comparable to double-precision baselines. Framework generalizes to diverse out-of-sample data.

Conclusion: First RL-based precision autotuning work that generalizes to unseen datasets, demonstrating potential for applying RL precision selection to other numerical algorithms in scientific computing.

Abstract: We propose a reinforcement learning (RL) framework for adaptive precision tuning of linear solvers, and can be extended to general algorithms. The framework is formulated as a contextual bandit problem and solved using incremental action-value estimation with a discretized state space to select optimal precision configurations for computational steps, balancing precision and computational efficiency. To verify its effectiveness, we apply the framework to iterative refinement for solving linear systems $Ax = b$. In this application, our approach dynamically chooses precisions based on calculated features from the system. In detail, a Q-table maps discretized features (e.g., approximate condition number and matrix norm)to actions (chosen precision configurations for specific steps), optimized via an epsilon-greedy strategy to maximize a multi-objective reward balancing accuracy and computational cost. Empirical results demonstrate effective precision selection, reducing computational cost while maintaining accuracy comparable to double-precision baselines. The framework generalizes to diverse out-of-sample data and offers insight into utilizing RL precision selection for other numerical algorithms, advancing mixed-precision numerical methods in scientific computing. To the best of our knowledge, this is the first work on precision autotuning with RL and verified on unseen datasets.

cs.MA

[733] Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Erica Coppolillo, Luca Luceri, Emilio Ferrara

Main category: cs.MA

TL;DR: LLM agents on AI social platforms adopt toxicity from exposure, with both stimulus-driven and spontaneous toxic responses emerging, creating a trade-off between induced and spontaneous toxicity.

Details

Motivation: While prior work documented LLM toxicity generation, little is known about how exposure to harmful content shapes agent behavior over time in AI-only social ecosystems, particularly in sequential, cumulative interactions.

Method: Study toxicity adoption on Chirper.ai, a fully AI-driven social platform, modeling interactions as stimuli (posts) and responses (comments), operationalizing exposure through observable interactions rather than inferred recommendation mechanisms. Conduct large-scale empirical analysis examining response-to-stimulus toxicity relationships, repeated exposure effects, and toxicity prediction from exposure.

Result: Toxic responses are more likely following toxic stimuli, but substantial toxicity emerges spontaneously. Cumulative toxic exposure significantly increases toxic response probability. Two influence metrics (Influence-Driven Response Rate and Spontaneous Response Rate) reveal strong trade-off between induced and spontaneous toxicity. Number of toxic stimuli alone enables accurate prediction of agent toxic content production.

Conclusion: Exposure is a critical risk factor in LLM agent deployment; monitoring encountered content provides lightweight yet effective mechanism for auditing and mitigating harmful behavior in real-world AI social ecosystems.

Abstract: Large Language Models (LLMs) are increasingly embedded in autonomous agents that participate in online social ecosystems, where interactions are sequential, cumulative, and only partially controlled. While prior work has documented the generation of toxic content by LLMs, far less is known about how exposure to harmful content shapes agent behavior over time, particularly in environments composed entirely of interacting AI agents. In this work, we study toxicity adoption of LLM-driven agents on Chirper.ai, a fully AI-driven social platform. Specifically, we model interactions in terms of stimuli (posts) and responses (comments), and by operationalizing exposure through observable interactions rather than inferred recommendation mechanisms. We conduct a large-scale empirical analysis of agent behavior, examining how response toxicity relates to stimulus toxicity, how repeated exposure affects the likelihood of toxic responses, and whether toxic behavior can be predicted from exposure alone. Our findings show that while toxic responses are more likely following toxic stimuli, a substantial fraction of toxicity emerges spontaneously, independent of exposure. At the same time, cumulative toxic exposure significantly increases the probability of toxic responding. We further introduce two influence metrics, the Influence-Driven Response Rate and the Spontaneous Response Rate, revealing a strong trade-off between induced and spontaneous toxicity. Finally, we show that the number of toxic stimuli alone enables accurate prediction of whether an agent will eventually produce toxic content. These results highlight exposure as a critical risk factor in the deployment of LLM agents and suggest that monitoring encountered content may provide a lightweight yet effective mechanism for auditing and mitigating harmful behavior in the wild.

Rishav Sen, Fangqi Liu, Jose Paolo Talusan, Ava Pettet, Yoshinori Suzue, Mark Bailey, Ayan Mukhopadhyay, Abhishek Dubey

Main category: cs.MA

TL;DR: A negotiation-based framework for EV charging in V2B settings that balances building operator costs with driver convenience through incentive-backed flexibility options, achieving cost savings for both parties.

Details

Motivation: The growth of EVs creates conflicts in V2B settings between building operators facing high energy costs from uncoordinated charging and drivers prioritizing convenience and full charges.

Method: Proposes a negotiation-based framework that transforms EV charging into a strategic resource by offering drivers incentive-backed options for modest flexibility in departure time or requested SoC, calibrated with user survey data and validated with real operational data.

Result: Simulations show the framework lowers building operator’s costs by over 3.5% compared to optimized non-negotiating smart charging, while reducing user charging expenses by 22% below utility retail rates.

Conclusion: The framework provides a strategic bridge between energy and mobility systems, transforming EV charging from operational friction into a platform for collaboration and shared savings by aligning operator and EV user objectives.

Abstract: The growth of Electric Vehicles (EVs) creates a conflict in vehicle-to-building (V2B) settings between building operators, who face high energy costs from uncoordinated charging, and drivers, who prioritize convenience and a full charge. To resolve this, we propose a negotiation-based framework that, by design, guarantees voluntary participation, strategy-proofness, and budget feasibility. It transforms EV charging into a strategic resource by offering drivers a range of incentive-backed options for modest flexibility in their departure time or requested state of charge (SoC). Our framework is calibrated with user survey data and validated using real operational data from a commercial building and an EV manufacturer. Simulations show that our negotiation protocol creates a mutually beneficial outcome: lowering the building operator’s costs by over 3.5% compared to an optimized, non-negotiating smart charging policy, while simultaneously reducing user charging expenses by 22% below the utility’s retail energy rate. By aligning operator and EV user objectives, our framework provides a strategic bridge between energy and mobility systems, transforming EV charging from a source of operational friction into a platform for collaboration and shared savings.

[735] ARIES: A Scalable Multi-Agent Orchestration Framework for Real-Time Epidemiological Surveillance and Outbreak Monitoring

Aniket Wattamwar, Sampson Akwafuo

Main category: cs.MA

TL;DR: ARIES is an autonomous multi-agent framework for epidemiological surveillance that uses specialized AI agents to query health data sources and identify emergent threats in real-time, overcoming limitations of generic AI models.

Details

Motivation: Current global health surveillance faces knowledge gaps, and general-purpose AI is unsuitable for high-stakes epidemiology due to hallucinations and inability to access specialized data silos. There's a need to move beyond static, disease-specific dashboards to dynamic intelligence systems.

Method: ARIES uses a hierarchical command structure with GPTs orchestrating scalable swarms of sub-agents that autonomously query WHO, CDC, and peer-reviewed research papers. It automates extraction and logical synthesis of surveillance data through specialized reasoning.

Result: The modular architecture demonstrates that task-specific agentic swarms can outperform generic models, providing robust, extensible capabilities for next-generation outbreak response and global health intelligence.

Conclusion: ARIES offers a dynamic intelligence ecosystem for epidemiological surveillance that identifies emergent threats and signal divergence in near real-time, representing a significant advancement over current static dashboard approaches.

Abstract: Global health surveillance is currently facing a challenge of Knowledge Gaps. While general-purpose AI has proliferated, it remains fundamentally unsuited for the high-stakes epidemiological domain due to chronic hallucinations and an inability to navigate specialized data silos. This paper introduces ARIES (Agentic Retrieval Intelligence for Epidemiological Surveillance), a specialized, autonomous multi-agent framework designed to move beyond static, disease-specific dashboards toward a dynamic intelligence ecosystem. Built on a hierarchical command structure, ARIES utilizes GPTs to orchestrate a scalable swarm of sub-agents capable of autonomously querying World Health Organization (WHO), Center for Disease Control and Prevention (CDC), and peer-reviewed research papers. By automating the extraction and logical synthesis of surveillance data, ARIES provides a specialized reasoning that identifies emergent threats and signal divergence in near real-time. This modular architecture proves that a task-specific agentic swarm can outperform generic models, offering a robust, extensible for next-generation outbreak response and global health intelligence.

[736] μACP: A Formal Calculus for Expressive, Resource-Constrained Agent Communication

Arnab Mallick, Indraveni Chebolu

Main category: cs.MA

TL;DR: μACP is a formal calculus for expressive agent communication with explicit resource bounds, proving a minimal 4-verb basis can encode FIPA protocols while achieving low latency under constraints.

Details

Motivation: Existing agent communication protocols face a trade-off: FIPA-ACL offers semantic richness but is intractable for constrained environments, while lightweight IoT protocols achieve efficiency at the expense of expressiveness.

Method: Develop μACP formal calculus with Resource-Constrained Agent Communication (RCAC) model, prove a minimal four-verb basis {PING, TELL, ASK, OBSERVE} suffices for FIPA protocols, establish information-theoretic bounds, and implement consensus under partial synchrony.

Result: Formal verification in TLA⁺ and Coq establishes safety and boundedness; simulations show median end-to-end latency of 34 ms (95th percentile 104 ms) at scale, outperforming prior protocols under severe resource constraints.

Conclusion: μACP provides a unified calculus that reconciles semantic expressiveness with provable efficiency, offering a rigorous foundation for next-generation resource-constrained multi-agent systems.

Abstract: Agent communication remains a foundational problem in multi-agent systems: protocols such as FIPA-ACL guarantee semantic richness but are intractable for constrained environments, while lightweight IoT protocols achieve efficiency at the expense of expressiveness. This paper presents $μ$ACP, a formal calculus for expressive agent communication under explicit resource bounds. We formalize the Resource-Constrained Agent Communication (RCAC) model, prove that a minimal four-verb basis \textit{{PING, TELL, ASK, OBSERVE}} is suffices to encode finite-state FIPA protocols, and establish tight information-theoretic bounds on message complexity. We further show that $μ$ACP can implement standard consensus under partial synchrony and crash faults, yielding a constructive coordination framework for edge-native agents. Formal verification in TLA$^{+}$ (model checking) and Coq (mechanized invariants) establishes safety and boundedness, and supports liveness under modeled assumptions. Large-scale system simulations confirm ACP achieves a median end-to-end message latency of 34 ms (95th percentile 104 ms) at scale, outperforming prior agent and IoT protocols under severe resource constraints. The main contribution is a unified calculus that reconciles semantic expressiveness with provable efficiency, providing a rigorous foundation for the next generation of resource-constrained multi-agent systems.

cs.MM

[737] Pedagogical Reflections on the Holistic Cognitive Development (HCD) Framework and AI-Augmented Learning in Creative Computing

Anand Bhojan

Main category: cs.MM

TL;DR: The paper expands the Holistic Cognitive Development (HCD) framework for computing education, integrating design thinking, experiential learning, and reflective practice into a constructivist pedagogy with AI-augmented tools for scalable personalized feedback.

Details

Motivation: To enhance reflective and creative learning in computing education by developing a comprehensive framework that addresses the need for deeper cognitive development, autonomy, and scalable personalized feedback in project-based courses.

Method: The HCD framework integrates design thinking, experiential learning, and reflective practice into a unified constructivist pedagogy emphasizing autonomy, ownership, and scaffolding. Applied across game design, virtual reality, and extended reality courses with iterative cycles of thinking, creating, criticizing, and reflecting. AI-augmented systems (iReflect, ReflexAI, Knowledge Graph-enhanced LLM feedback tools) operationalize the framework.

Result: Empirical findings demonstrate improved reflective depth, feedback quality, and learner autonomy. AI tools provide scalable, personalized feedback that supports the HCD framework effectively.

Conclusion: Advocates for a balance of supportive autonomy in supervision, where students practice self-directed inquiry while being guided through structured reflection and feedback. The HCD framework with AI augmentation offers an effective approach to holistic cognitive development in computing education.

Abstract: This paper presents an expanded account of the Holistic Cognitive Development (HCD) framework for reflective and creative learning in computing education. The HCD framework integrates design thinking, experiential learning, and reflective practice into a unified constructivist pedagogy emphasizing autonomy, ownership, and scaffolding. It is applied across courses in game design (CS3247, CS4350), virtual reality (CS4240), and extended reality systems, where students engage in iterative cycles of thinking, creating, criticizing, and reflecting. The paper also examines how AI-augmented systems such as iReflect, ReflexAI, and Knowledge Graph-enhanced LLM feedback tools operationalize the HCD framework through scalable, personalized feedback. Empirical findings demonstrate improved reflective depth, feedback quality, and learner autonomy. The work advocates a balance of supportive autonomy in supervision, where students practice self-directed inquiry while guided through structured reflection and feedback.

eess.AS

[738] Speak the Art: A Direct Speech to Image Generation Framework

Mariam Saeed, Manar Amr, Farida Adel, Nada Hassan, Nour Walid, Eman Mohamed, Mohamed Hussein, Marwan Torki

Main category: eess.AS

TL;DR: STA framework improves speech-to-image generation using diffusion models instead of GANs, with better speech embeddings supervised by pre-trained image-text models, achieving multilingual capability (English/Arabic) and state-of-the-art results.

Details

Motivation: Current speech-to-image generation has a performance gap compared to text-to-image. Existing approaches use GANs which suffer from instability, mode collapse, and gradient issues, and speech embeddings lack sufficient linguistic information.

Method: STA framework combines: 1) Speech encoding network supervised by large pre-trained image-text model to improve embeddings, 2) VQ-Diffusion network conditioned on speech embeddings instead of GANs for more stable training and diverse generation, 3) Multilingual extension tested with English and Arabic.

Result: Results surpass state-of-the-art models by a large margin. The framework shows stable training, diverse image generation, and successful multilingual capability with English and Arabic.

Conclusion: STA framework effectively addresses limitations of current speech-to-image generation by improving speech embeddings and replacing GANs with diffusion models, achieving superior performance and demonstrating multilingual potential.

Abstract: Direct speech-to-image generation has recently shown promising results. However, compared to text-to-image generation, there is still a large gap to enclose. Current approaches use two stages to tackle this task: speech encoding network and image generative adversarial network (GAN). The speech encoding networks in these approaches produce embeddings that do not capture sufficient linguistic information to semantically represent the input speech. GANs suffer from issues such as non-convergence, mode collapse, and diminished gradient, which result in unstable model parameters, limited sample diversity, and ineffective generator learning, respectively. To address these weaknesses, we introduce a framework called \textbf{Speak the Art (STA)} which consists of a speech encoding network and a VQ-Diffusion network conditioned on speech embeddings. To improve speech embeddings, the speech encoding network is supervised by a large pre-trained image-text model during training. Replacing GANs with diffusion leads to more stable training and the generation of diverse images. Additionally, we investigate the feasibility of extending our framework to be multilingual. As a proof of concept, we trained our framework with two languages: English and Arabic. Finally, we show that our results surpass state-of-the-art models by a large margin.

[739] Improving Code-Switching Speech Recognition with TTS Data Augmentation

Yue Heng Yeo, Yuchen Hu, Shreyas Gopal, Yizhou Peng, Hexin Liu, Eng Siong Chng

Main category: eess.AS

TL;DR: Using multilingual TTS to generate synthetic Chinese-English code-switching speech for ASR data augmentation reduces error rates by 2-1.8% on conversational datasets.

Details

Motivation: Conversational code-switching ASR suffers from scarce labeled speech data, creating a need for effective data augmentation techniques to improve model performance.

Method: Fine-tune multilingual CosyVoice2 TTS model on SEAME dataset to generate synthetic conversational Chinese-English code-switching speech, then use synthetic speech to augment real training data for ASR.

Result: Augmentation with synthetic speech reduced mixed error rate from 12.1% to 10.1% on DevMan and from 17.8% to 16.0% on DevSGE, showing consistent performance improvements.

Conclusion: Multilingual TTS is an effective and practical data augmentation tool for enhancing ASR robustness in low-resource conversational code-switching scenarios.

Abstract: Automatic speech recognition (ASR) for conversational code-switching speech remains challenging due to the scarcity of realistic, high-quality labeled speech data. This paper explores multilingual text-to-speech (TTS) models as an effective data augmentation technique to address this shortage. Specifically, we fine-tune the multilingual CosyVoice2 TTS model on the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech, significantly increasing the quantity and speaker diversity of available training data. Our experiments demonstrate that augmenting real speech with synthetic speech reduces the mixed error rate (MER) from 12.1 percent to 10.1 percent on DevMan and from 17.8 percent to 16.0 percent on DevSGE, indicating consistent performance gains. These results confirm that multilingual TTS is an effective and practical tool for enhancing ASR robustness in low-resource conversational code-switching scenarios.

[740] Bayesian Negative Binomial Regression of Afrobeats Chart Persistence

Ian Jacob Cabansag, Paul Ntegeka

Main category: eess.AS

TL;DR: Collaboration tracks on Nigeria Spotify charts tend to have slightly shorter chart longevity than solo tracks after controlling for total streams.

Details

Motivation: To understand whether collaborations help songs remain on streaming charts longer, which affects both revenue and cultural impact in the competitive Afrobeats market.

Method: Used daily Nigeria Spotify Top 200 data from 2024, summarized tracks by days on chart and total annual streams. Applied Bayesian negative binomial regression with days on chart as outcome, collaboration status (solo vs multi-artist) and log total streams as predictors. Conducted posterior inference using Markov chain Monte Carlo.

Result: After accounting for total streams, collaboration tracks tend to spend slightly fewer days on the chart than comparable solo tracks.

Conclusion: Collaborations do not extend chart longevity in the Nigeria Spotify market; solo tracks have better staying power when controlling for overall popularity.

Abstract: Afrobeats songs compete for attention on streaming platforms, where chart visibility can influence both revenue and cultural impact. This paper examines whether collaborations help songs remain on the charts longer, using daily Nigeria Spotify Top 200 data from 2024. Each track is summarized by the number of days it appears in the Top 200 during the year and its total annual streams in Nigeria. A Bayesian negative binomial regression is applied, with days on chart as the outcome and collaboration status (solo versus multi-artist) and log total streams as predictors. This approach is well suited for overdispersed count data and allows the effect of collaboration to be interpreted while controlling for overall popularity. Posterior inference is conducted using Markov chain Monte Carlo, and results are assessed using rate ratios, posterior probabilities, and predictive checks. The findings indicate that, after accounting for total streams, collaboration tracks tend to spend slightly fewer days on the chart than comparable solo tracks.

[741] MORE: Multi-Objective Adversarial Attacks on Speech Recognition

Xiaoxue Gao, Zexin Li, Yiming Chen, Nancy F. Chen

Main category: eess.AS

TL;DR: MORE is a multi-objective adversarial attack on ASR models that simultaneously degrades both accuracy and efficiency through hierarchical optimization and repetitive encouragement doubling.

Details

Motivation: While large-scale ASR models like Whisper are widely adopted, their robustness against minor perturbations is critical for real-world reliability. Prior work focused only on accuracy degradation under attacks, ignoring efficiency aspects, providing an incomplete understanding of ASR vulnerabilities.

Method: Introduces MORE (Multi-Objective Repetitive Doubling Encouragement attack) with a hierarchical staged repulsion-anchoring mechanism. Reformulates multi-objective adversarial optimization into a hierarchical framework that sequentially achieves dual objectives (accuracy degradation and efficiency reduction). Proposes Repetitive Encouragement Doubling Objective (REDO) that induces duplicative text generation by maintaining accuracy degradation while periodically doubling predicted sequence length.

Result: MORE consistently yields significantly longer transcriptions while maintaining high word error rates compared to existing baselines. The attack compels ASR models to produce incorrect transcriptions at substantially higher computational cost from a single adversarial input.

Conclusion: MORE effectively demonstrates multi-objective adversarial attacks on ASR models, highlighting vulnerabilities in both accuracy and efficiency dimensions. This provides a more comprehensive understanding of ASR robustness and exposes previously unexplored attack surfaces.

Abstract: The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate multi-objective adversarial optimization into a hierarchical framework that sequentially achieves the dual objectives. To further amplify effectiveness, we propose a novel repetitive encouragement doubling objective (REDO) that induces duplicative text generation by maintaining accuracy degradation and periodically doubling the predicted sequence length. Overall, MORE compels ASR models to produce incorrect transcriptions at a substantially higher computational cost, triggered by a single adversarial input. Experiments show that MORE consistently yields significantly longer transcriptions while maintaining high word error rates compared to existing baselines, underscoring its effectiveness in multi-objective adversarial attack.

[742] Towards Prosodically Informed Mizo TTS without Explicit Tone Markings

Abhijit Mohanta, Remruatpuii, Priyankoo Sarmah, Rohit Sinha, Wendy Lalhminghlui

Main category: eess.AS

TL;DR: VITS outperforms Tacotron2 for Mizo TTS using only 5.18 hours of data, achieving acceptable quality and intelligibility with better tone synthesis.

Details

Motivation: To develop a text-to-speech system for Mizo, a low-resource tonal Tibeto-Burman language, despite having only limited training data available.

Method: Built two TTS models: baseline Tacotron2 and VITS (non-autoregressive, end-to-end framework) using only 5.18 hours of Mizo speech data.

Result: VITS outperformed Tacotron2 in both subjective and objective evaluations, with significantly lower tone errors. Both models produced perceptually acceptable and intelligible outputs.

Conclusion: Non-autoregressive, end-to-end frameworks like VITS can achieve acceptable TTS quality for low-resource tonal languages with minimal data, outperforming traditional autoregressive approaches.

Abstract: This paper reports on the development of a text-to-speech (TTS) system for Mizo, a low-resource, tonal, and Tibeto-Burman language spoken primarily in the Indian state of Mizoram. The TTS was built with only 5.18 hours of data; however, in terms of subjective and objective evaluations, the outputs were considered perceptually acceptable and intelligible. A baseline model using Tacotron2 was built, and then, with the same data, another TTS model was built with VITS. In both subjective and objective evaluations, the VITS model outperformed the Tacotron2 model. In terms of tone synthesis, the VITS model showed significantly lower tone errors than the Tacotron2 model. The paper demonstrates that a non-autoregressive, end-to-end framework can achieve synthesis of acceptable perceptual quality and intelligibility.

[743] On the Role of Spatial Features in Foundation-Model-Based Speaker Diarization

Marc Deegen, Tobias Gburrek, Tobias Cord-Landwehr, Thilo von Neumann, Jiangyu Han, Lukáš Burget, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: Spatial cues from multi-channel audio provide only modest improvements to state-of-the-art speaker diarization systems using WavLM foundation models, suggesting these models already capture most speaker discrimination information.

Details

Motivation: Current state-of-the-art speaker diarization systems use pretrained foundation models like WavLM but are limited to single-channel audio, missing spatial cues available in multi-channel recordings that could potentially improve performance.

Method: The work analyzes several strategies for incorporating spatial information into a single-channel diarization system by conditioning the model on multi-channel spatial features, evaluated on meeting-style datasets.

Result: Spatial information does improve diarization performance, but the overall improvement is smaller than expected, suggesting that features aggregated over all WavLM layers already capture much of the information needed for accurate speaker discrimination, including in overlapping speech regions.

Conclusion: While spatial cues can enhance foundation model-based diarization, their impact is limited because WavLM features already contain substantial speaker discrimination information, providing insight into the potential and limitations of using spatial information with pretrained models.

Abstract: Recent advances in speaker diarization exploit large pretrained foundation models, such as WavLM, to achieve state-of-the-art performance on multiple datasets. Systems like DiariZen leverage these rich single-channel representations, but are limited to single-channel audio, preventing the use of spatial cues available in multi-channel recordings. This work analyzes the impact of incorporating spatial information into a state-of-the-art single-channel diarization system by evaluating several strategies for conditioning the model on multi-channel spatial features. Experiments on meeting-style datasets indicate that spatial information can improve diarization performance, but the overall improvement is smaller than expected for the proposed system, suggesting that the features aggregated over all WavLM layers already capture much of the information needed for accurate speaker discrimination, also in overlapping speech regions. These findings provide insight into the potential and limitations of using spatial cues to enhance foundation model-based diarization.

Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee

Main category: eess.AS

TL;DR: SSL speech models acquire social biases from training data, affecting marginalized groups. This study analyzes how model architecture, size, and training methods influence bias propagation, and finds that row-pruning and wider/shallower models can effectively mitigate bias.

Details

Motivation: SSL speech models achieve strong performance but produce biased outcomes that disproportionately affect marginalized groups, potentially automating discrimination and reinforcing inequitable systems. There is a need to understand and mitigate these biases.

Method: The study probes how various factors (model architecture, size, training methodologies) influence social bias propagation in SSL models. It explores debiasing through regularization techniques, specifically model compression methods like row-pruning and training wider, shallower models.

Result: Findings reveal that prevalent SSL models inadvertently acquire biased associations. Row-pruning and training wider, shallower models can effectively mitigate social bias within SSL models.

Conclusion: SSL speech models inherit social biases from training data, but these biases can be mitigated through specific architectural and training interventions like model compression techniques, offering a pathway toward more equitable speech AI systems.

Abstract: Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by automating discriminatory patterns and reinforcing inequitable systems. This work reveals that prevalent SSL models inadvertently acquire biased associations. We probe how various factors, such as model architecture, size, and training methodologies, influence the propagation of social bias within these models. Finally, we explore the efficacy of debiasing SSL models through regularization techniques, specifically via model compression. Our findings reveal that employing techniques such as row-pruning and training wider, shallower models can effectively mitigate social bias within SSL model.

eess.IV

[745] Placenta Accreta Spectrum Detection using Multimodal Deep Learning

Sumaiya Ali, Areej Alhothali, Sameera Albasri, Ohoud Alzamzami, Ahmed Abduljabbar, Muhammad Alwazzan

Main category: eess.IV

TL;DR: A multimodal deep learning framework combining 3D MRI and 2D ultrasound scans outperforms single-modality approaches for detecting Placenta Accreta Spectrum, achieving 92.5% accuracy.

Details

Motivation: Placenta Accreta Spectrum (PAS) is a life-threatening obstetric complication requiring early and accurate prenatal diagnosis to reduce maternal and neonatal risks. Current single-modality approaches may be insufficient for optimal detection.

Method: Developed a multimodal deep learning framework using intermediate feature-level fusion architecture combining 3D MRI (using 3D DenseNet121-Vision Transformer) and 2D ultrasound (using ResNet50). Used curated datasets of 1,293 MRI and 1,143 US scans, with patient-matched MRI-US pairs for multimodal development.

Result: Multimodal fusion model achieved 92.5% accuracy and AUC of 0.927, outperforming MRI-only (82.5%, AUC 0.825) and US-only (87.5%, AUC 0.879) models on independent test set.

Conclusion: Integrating MRI and US features provides complementary diagnostic information, demonstrating strong potential to enhance prenatal risk assessment and improve patient outcomes for Placenta Accreta Spectrum detection.

Abstract: Placenta Accreta Spectrum (PAS) is a life-threatening obstetric complication involving abnormal placental invasion into the uterine wall. Early and accurate prenatal diagnosis is essential to reduce maternal and neonatal risks. This study aimed to develop and validate a deep learning framework that enhances PAS detection by integrating multiple imaging modalities. A multimodal deep learning model was designed using an intermediate feature-level fusion architecture combining 3D Magnetic Resonance Imaging (MRI) and 2D Ultrasound (US) scans. Unimodal feature extractors, a 3D DenseNet121-Vision Transformer for MRI and a 2D ResNet50 for US, were selected after systematic comparative analysis. Curated datasets comprising 1,293 MRI and 1,143 US scans were used to train the unimodal models and paired samples of patient-matched MRI-US scans was isolated for multimodal model development and evaluation. On an independent test set, the multimodal fusion model achieved superior performance, with an accuracy of 92.5% and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.927, outperforming the MRI-only (82.5%, AUC 0.825) and US-only (87.5%, AUC 0.879) models. Integrating MRI and US features provides complementary diagnostic information, demonstrating strong potential to enhance prenatal risk assessment and improve patient outcomes.

[746] MetaFormer-driven Encoding Network for Robust Medical Semantic Segmentation

Le-Anh Tran, Chung Nguyen Tran, Nhan Cach Dang, Anh Le Van Quoc, Jordi Carrabina, David Castells-Rufas, Minh Son Nguyen

Main category: eess.IV

TL;DR: MFEnNet is an efficient medical image segmentation framework that integrates MetaFormer into U-Net’s encoder, using pooling transformer blocks instead of self-attention to reduce computational cost while maintaining competitive accuracy.

Details

Motivation: Advanced medical image segmentation models often have complex architectures that are impractical for resource-constrained clinical settings, creating a need for efficient yet accurate alternatives.

Method: MFEnNet incorporates MetaFormer in U-Net’s encoding phase, replacing conventional transformer modules with pooling transformer blocks to reduce computational complexity. It also uses Swish activation for smoother gradients and spatial pyramid pooling at the bottleneck for multi-scale feature extraction.

Result: The framework achieves competitive segmentation accuracy on medical benchmarks while significantly lowering computational costs compared to state-of-the-art models.

Conclusion: MFEnNet provides an efficient solution for medical image segmentation that balances accuracy and computational efficiency, making it suitable for clinical applications with limited resources.

Abstract: Semantic segmentation is crucial for medical image analysis, enabling precise disease diagnosis and treatment planning. However, many advanced models employ complex architectures, limiting their use in resource-constrained clinical settings. This paper proposes MFEnNet, an efficient medical image segmentation framework that incorporates MetaFormer in the encoding phase of the U-Net backbone. MetaFormer, an architectural abstraction of vision transformers, provides a versatile alternative to convolutional neural networks by transforming tokenized image patches into sequences for global context modeling. To mitigate the substantial computational cost associated with self-attention, the proposed framework replaces conventional transformer modules with pooling transformer blocks, thereby achieving effective global feature aggregation at reduced complexity. In addition, Swish activation is used to achieve smoother gradients and faster convergence, while spatial pyramid pooling is incorporated at the bottleneck to improve multi-scale feature extraction. Comprehensive experiments on different medical segmentation benchmarks demonstrate that the proposed MFEnNet approach attains competitive accuracy while significantly lowering computational cost compared to state-of-the-art models. The source code for this work is available at https://github.com/tranleanh/mfennet.

[747] Learned Hemodynamic Coupling Inference in Resting-State Functional MRI

William Consagra, Eardi Lila

Main category: eess.IV

TL;DR: Proposes a method to estimate spatially varying hemodynamic coupling from resting-state fMRI by marginalizing out latent neural activity and using deep neural networks with conditional normalizing flows for scalable inference.

Details

Motivation: Hemodynamic variability across brain regions and individuals biases fMRI connectivity estimates, and hemodynamic parameters themselves could serve as important biomarkers. Current methods struggle with the blind inverse problem of estimating hemodynamics when both neural activity and hemodynamic coupling are unknown.

Method: Marginalizes out latent neural signals to avoid unstable joint recovery, uses deep neural networks with conditional normalizing flows to approximate intractable marginal likelihood, enforces spatial coherence through cortical surface priors with sparse representations.

Result: Extensive validation with synthetic and real fMRI datasets shows clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.

Conclusion: The proposed approach enables scalable, high-resolution estimation of hemodynamic coupling from resting-state fMRI, addressing the challenging blind inverse problem while improving both hemodynamic parameter estimation and connectivity analysis.

Abstract: Functional magnetic resonance imaging (fMRI) provides an indirect measurement of neuronal activity via hemodynamic responses that vary across brain regions and individuals. Ignoring this hemodynamic variability can bias downstream connectivity estimates. Furthermore, the hemodynamic parameters themselves may serve as important imaging biomarkers. Estimating spatially varying hemodynamics from resting-state fMRI (rsfMRI) is therefore an important but challenging blind inverse problem, since both the latent neural activity and the hemodynamic coupling are unknown. In this work, we propose a methodology for inferring hemodynamic coupling on the cortical surface from rsfMRI. Our approach avoids the highly unstable joint recovery of neural activity and hemodynamics by marginalizing out the latent neural signal and basing inference on the resulting marginal likelihood. To enable scalable, high-resolution estimation, we employ a deep neural network combined with conditional normalizing flows to accurately approximate this intractable marginal likelihood, while enforcing spatial coherence through priors defined on the cortical surface that admit sparse representations. The proposed approach is extensively validated using synthetic data and real fMRI datasets, demonstrating clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.

[748] Uncertainty-Calibrated Explainable AI for Fetal Ultrasound Plane Classification

Olaf Yunus Laitinen Imanov

Main category: eess.IV

TL;DR: A framework for uncertainty-calibrated explainable AI in fetal ultrasound plane classification that combines multiple uncertainty estimation methods with explainability techniques and maps them to clinical workflows.

Details

Motivation: Real-world deployment of fetal ultrasound standard-plane classification is limited by domain shift, image noise, and poor calibration of predicted probabilities, requiring trustworthy confidence estimates and explanations under noisy acquisition conditions.

Method: Synthesizes uncertainty estimation methods (Monte Carlo dropout, deep ensembles, evidential learning, conformal prediction) with post-hoc and uncertainty-aware explanations (Grad-CAM variants, LIME-style local surrogates, uncertainty-weighted multi-resolution activation maps), mapped to clinician-facing workflow with reporting protocol.

Result: Defines a reporting protocol using FETAL_PLANES_DB benchmark that couples accuracy with calibration and selective prediction metrics (expected calibration error, Brier score, coverage-risk curves) plus structured error analysis with explanations.

Conclusion: Provides a reproducible, clinically aligned blueprint for building fetal ultrasound classifiers with trustworthy confidence estimates and explanations that support quality control and human-in-the-loop review under noisy conditions.

Abstract: Fetal ultrasound standard-plane classification underpins reliable prenatal biometry and anomaly screening, yet real-world deployment is limited by domain shift, image noise, and poor calibration of predicted probabilities. This paper presents a practical framework for uncertainty-calibrated explainable AI in fetal plane classification. We synthesize uncertainty estimation methods (Monte Carlo dropout, deep ensembles, evidential learning, and conformal prediction) with post-hoc and uncertainty-aware explanations (Grad-CAM variants, LIME-style local surrogates, and uncertainty-weighted multi-resolution activation maps), and we map these components to a clinician-facing workflow. Using FETAL_PLANES_DB as a reference benchmark, we define a reporting protocol that couples accuracy with calibration and selective prediction, including expected calibration error, Brier score, coverage-risk curves, and structured error analysis with explanations. We also discuss integration points for quality control and human-in-the-loop review, where uncertainty flags trigger re-acquisition or expert confirmation. The goal is a reproducible, clinically aligned blueprint for building fetal ultrasound classifiers whose confidence and explanations remain trustworthy under noisy acquisition conditions.

[749] Scale-aware Adaptive Supervised Network with Limited Medical Annotations

Zihan Li, Dandan Shan, Yunxiang Li, Paul E. Kinahan, Qingqi Hong

Main category: eess.IV

TL;DR: SASNet is a semi-supervised medical image segmentation network that addresses annotation scarcity and variability through scale-aware adaptive reweighting, view variance enhancement, and segmentation-regression consistency learning.

Details

Motivation: Medical image segmentation faces severe annotation scarcity requiring expert knowledge, significant inter-annotator variability, and inadequate multi-scale feature integration for precise boundary delineation. Existing semi-supervised methods show substantial performance degradation compared to fully supervised approaches.

Method: Proposes SASNet with three key innovations: 1) Scale-aware Adaptive Reweight strategy using temporal confidence accumulation, 2) View Variance Enhancement mechanism employing 3D Fourier domain transformations to simulate annotation variability, and 3) segmentation-regression consistency learning through signed distance map algorithms for boundary precision.

Result: Comprehensive evaluation across LA, Pancreas-CT, and BraTS datasets shows SASNet achieves superior performance with limited labeled data, surpassing state-of-the-art semi-supervised methods while approaching fully supervised performance levels.

Conclusion: SASNet effectively addresses core limitations of semi-supervised medical image segmentation by integrating spatial, temporal, and geometric consistency principles, demonstrating strong performance with limited annotations.

Abstract: Medical image segmentation faces critical challenges in semi-supervised learning scenarios due to severe annotation scarcity requiring expert radiological knowledge, significant inter-annotator variability across different viewpoints and expertise levels, and inadequate multi-scale feature integration for precise boundary delineation in complex anatomical structures. Existing semi-supervised methods demonstrate substantial performance degradation compared to fully supervised approaches, particularly in small target segmentation and boundary refinement tasks. To address these fundamental challenges, we propose SASNet (Scale-aware Adaptive Supervised Network), a dual-branch architecture that leverages both low-level and high-level feature representations through novel scale-aware adaptive reweight mechanisms. Our approach introduces three key methodological innovations, including the Scale-aware Adaptive Reweight strategy that dynamically weights pixel-wise predictions using temporal confidence accumulation, the View Variance Enhancement mechanism employing 3D Fourier domain transformations to simulate annotation variability, and segmentation-regression consistency learning through signed distance map algorithms for enhanced boundary precision. These innovations collectively address the core limitations of existing semi-supervised approaches by integrating spatial, temporal, and geometric consistency principles within a unified optimization framework. Comprehensive evaluation across LA, Pancreas-CT, and BraTS datasets demonstrates that SASNet achieves superior performance with limited labeled data, surpassing state-of-the-art semi-supervised methods while approaching fully supervised performance levels. The source code for SASNet is available at https://github.com/HUANGLIZI/SASNet.

[750] An Explainable Agentic AI Framework for Uncertainty-Aware and Abstention-Enabled Acute Ischemic Stroke Imaging Decisions

Md Rashadul Islam

Main category: eess.IV

TL;DR: Proposes an explainable agentic AI framework for acute ischemic stroke imaging that incorporates uncertainty estimation and selective abstention to improve safety and trust in emergency radiology.

Details

Motivation: Current AI models for stroke imaging operate as black boxes without uncertainty awareness or mechanisms to abstain under ambiguous conditions, raising safety and trust concerns in high-risk emergency settings.

Method: A modular agentic pipeline with three components: perception agent for lesion-aware image analysis, uncertainty estimation agent for slice-level predictive reliability, and decision agent that determines whether to predict or abstain based on predefined uncertainty thresholds.

Result: The framework demonstrates that uncertainty-driven abstention naturally occurs in diagnostically ambiguous regions and low information slices, with integrated visual explanations supporting both predictive and abstention decisions.

Conclusion: Agentic control, uncertainty awareness, and selective abstention are essential design principles for developing safe and trustworthy medical imaging AI systems, prioritizing clinical safety and transparency over mere accuracy improvements.

Abstract: Artificial intelligence models have shown strong potential in acute ischemic stroke imaging, particularly for lesion detection and segmentation using computed tomography and magnetic resonance imaging. However, most existing approaches operate as black box predictors, producing deterministic outputs without explicit uncertainty awareness or structured mechanisms to abstain under ambiguous conditions. This limitation raises serious safety and trust concerns in high risk emergency radiology settings. In this paper, we propose an explainable agentic AI framework for uncertainty aware and abstention enabled decision support in acute ischemic stroke imaging. The framework follows a modular agentic pipeline in which a perception agent performs lesion aware image analysis, an uncertainty estimation agent computes slice level predictive reliability, and a decision agent determines whether to issue a prediction or abstain based on predefined uncertainty thresholds. Unlike prior stroke imaging systems that primarily focus on improving segmentation or classification accuracy, the proposed framework explicitly prioritizes clinical safety, transparency, and clinician aligned decision behavior. Qualitative and case based analyses across representative stroke imaging scenarios demonstrate that uncertainty driven abstention naturally emerges in diagnostically ambiguous regions and low information slices. The framework further integrates visual explanation mechanisms to support both predictive and abstention decisions, addressing a key limitation of existing uncertainty aware medical imaging systems. Rather than introducing a new performance benchmark, this work presents agentic control, uncertainty awareness, and selective abstention as essential design principles for developing safe and trustworthy medical imaging AI systems.

[751] YODA: Yet Another One-step Diffusion-based Video Compressor

Xingchen Li, Junzhe Zhang, Junqi Shi, Ming Lu, Zhan Ma

Main category: eess.IV

TL;DR: YODA is a one-step diffusion-based video compressor that uses temporal references and a linear Diffusion Transformer for efficient video compression with state-of-the-art perceptual quality.

Details

Motivation: Existing one-step diffusion models excel in image compression but struggle with video due to reliance on pretrained 2D autoencoders that process frames independently, ignoring temporal dependencies and correlations.

Method: YODA embeds multiscale features from temporal references for both latent generation and coding to exploit spatial-temporal correlations, and uses a linear Diffusion Transformer (DiT) for efficient one-step denoising.

Result: YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID metrics.

Conclusion: YODA demonstrates that effectively exploiting temporal dependencies through multiscale feature embedding and efficient one-step diffusion can achieve superior video compression with excellent perceptual quality.

Abstract: While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA–Yet Another One-step Diffusion-based Video Compressor–which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at https://github.com/NJUVISION/YODA.

Wei Sun, Linhan Cao, Jun Jia, Zhichao Zhang, Zicheng Zhang, Xiongkuo Min, Guangtao Zhai

Main category: eess.IV

TL;DR: RQ-VQA enhances blind video quality assessment by leveraging rich quality-aware features from existing BIQA/BVQA models to improve generalization on diverse social media videos.

Details

Motivation: Current BVQA models perform well on their training datasets but generalize poorly to unseen videos, especially diverse social media videos from various sources with different compression/enhancement algorithms.

Method: Multi-source feature framework integrating: 1) learnable spatial features from fine-tuned base model, 2) temporal motion features from SlowFast’s fast pathway, 3) spatial quality-aware features from BIQA models, and 4) spatiotemporal features from BVQA models, concatenated and fed into MLP for quality score regression.

Result: Achieves state-of-the-art performance on three public social media VQA datasets, demonstrating improved generalization capability.

Conclusion: RQ-VQA effectively leverages existing quality assessment expertise through multi-source feature integration to enhance BVQA generalization for diverse social media videos.

Abstract: Blind video quality assessment (BVQA) is a highly challenging task due to the intrinsic complexity of video content and visual distortions, especially given the high popularity of social media videos, which originate from a wide range of sources, and are often processed by various compression and enhancement algorithms. While recent BVQA and blind image quality assessment (BIQA) studies have made remarkable progress, their models typically perform well on the datasets they were trained on but generalize poorly to unseen videos, making them less effective for accurately evaluating the perceptual quality of diverse social media videos. In this paper, we propose Rich Quality-aware features enabled Video Quality Assessment (RQ-VQA), a simple yet effective method to enhance BVQA by leveraging rich quality-aware features extracted from off-the-shelf BIQA and BVQA models. Our approach exploits the expertise of existing quality assessment models within their trained domains to improve generalization. Specifically, we design a multi-source feature framework that integrates:(1) Learnable spatial features} from a base model fine-tuned on the target VQA dataset to capture domain-specific quality cues; (2) Temporal motion features from the fast pathway of SlowFast pre-trained on action recognition datasets to model motion-related distortions; (3) Spatial quality-aware features from BIQA models trained on diverse IQA datasets to enhance frame-level distortion representation; and (4) Spatiotemporal quality-aware features from a BVQA model trained on large-scale VQA datasets to jointly encode spatial structure and temporal dynamics. These features are concatenated and fed into a multi-layer perceptron (MLP) to regress them into quality scores. Experimental results demonstrate that our model achieves state-of-the-art performance on three public social media VQA datasets.

[753] Seamlessly Natural: Image Stitching with Natural Appearance Preservation

Gaetane Lorna N. Tchana, Damaris Belle M. Fotso, Antonio Hendricks, Christophe Bobda

Main category: eess.IV

TL;DR: SENA is a geometry-driven image stitching method that preserves structural fidelity in scenes with parallax and depth variation, overcoming limitations of traditional homography-based approaches.

Details

Motivation: Traditional image stitching using homographic alignment fails in dual-camera setups with significant scene depth, causing distortions like visible warps and spherical bulging due to rigid planar assumptions that don't handle parallax and depth variation well.

Method: Three key contributions: 1) Hierarchical affine-based warping combining global affine initialization with local affine refinement and smooth free-form deformation; 2) Geometry-driven adequate zone detection using disparity consistency of RANSAC-filtered features; 3) Anchor-based seamline cutting and segmentation enforcing one-to-one geometric correspondence.

Result: Extensive experiments show SENA achieves alignment accuracy comparable to leading homography-based methods while significantly outperforming them in shape preservation, texture integrity, and overall visual realism.

Conclusion: SENA provides a geometry-driven solution that effectively handles challenging real-world scenes with parallax and depth variation, eliminating common artifacts like ghosting, duplication, and smearing while preserving structural fidelity.

Abstract: This paper introduces SENA (SEamlessly NAtural), a geometry-driven image stitching approach that prioritizes structural fidelity in challenging real-world scenes characterized by parallax and depth variation. Conventional image stitching relies on homographic alignment, but this rigid planar assumption often fails in dual-camera setups with significant scene depth, leading to distortions such as visible warps and spherical bulging. SENA addresses these fundamental limitations through three key contributions. First, we propose a hierarchical affine-based warping strategy, combining global affine initialization with local affine refinement and smooth free-form deformation. This design preserves local shape, parallelism, and aspect ratios, thereby avoiding the hallucinated structural distortions commonly introduced by homography-based models. Second, we introduce a geometry-driven adequate zone detection mechanism that identifies parallax-minimized regions directly from the disparity consistency of RANSAC-filtered feature correspondences, without relying on semantic segmentation. Third, building upon this adequate zone, we perform anchor-based seamline cutting and segmentation, enforcing a one-to-one geometric correspondence across image pairs by construction, which effectively eliminates ghosting, duplication, and smearing artifacts in the final panorama. Extensive experiments conducted on challenging datasets demonstrate that SENA achieves alignment accuracy comparable to leading homography-based methods, while significantly outperforming them in critical visual metrics such as shape preservation, texture integrity, and overall visual realism.

[754] Sim2Real SAR Image Restoration: Metadata-Driven Models for Joint Despeckling and Sidelobes Reduction

Antoine De Paepe, Pascal Nguyen, Michael Mabelle, Cédric Saleun, Antoine Jouadé, Jean-Christophe Louvigne

Main category: eess.IV

TL;DR: A unified neural network framework for joint SAR image restoration addressing both speckle noise and sidelobe artifacts using simulation-to-real transfer learning.

Details

Motivation: SAR imagery suffers from two key artifacts: speckle noise and sidelobes around bright targets. Existing methods treat these as separate problems, but they often co-occur in real SAR images, requiring a unified approach.

Method: Proposes a neural network framework trained on realistic SAR simulated data generated with MOCEM. The approach jointly performs despeckling and sidelobe reduction, incorporates acquisition metadata as auxiliary input, and demonstrates simulation-to-real transferability.

Result: The unified framework effectively performs both restoration tasks on real SAR images, showing successful Sim2Real transfer. The inclusion of acquisition metadata further improves restoration performance.

Conclusion: A unified neural network approach trained on simulated SAR data can effectively address both speckle and sidelobe artifacts in real SAR imagery, with metadata integration enhancing restoration quality through simulation-to-real transfer learning.

Abstract: Synthetic aperture radar (SAR) provides valuable information about the Earth’s surface under all weather and illumination conditions. However, the inherent phenomenon of speckle and the presence of sidelobes around bright targets pose challenges for accurate interpretation of SAR imagery. Most existing SAR image restoration methods address despeckling and sidelobes reduction as separate tasks. In this paper, we propose a unified framework that jointly performs both tasks using neural networks (NNs) trained on a realistic SAR simulated dataset generated with MOCEM. Inference can then be performed on real SAR images, demonstrating effective simulation to real (Sim2Real) transferability. Additionally, we incorporate acquisition metadata as auxiliary input to the NNs, demonstrating improved restoration performance.

[755] UniCrop: A Universal, Multi-Source Data Engineering Pipeline for Scalable Crop Yield Prediction

Emiliya Khidirova, Oktay Karakuş

Main category: eess.IV

TL;DR: UniCrop is a universal data pipeline that automates multi-source environmental data processing for crop yield prediction, reducing 200+ variables to 15 key features and achieving strong predictive performance.

Details

Motivation: Existing crop yield prediction approaches are crop- or region-specific and require extensive data engineering, limiting scalability, reproducibility, and operational deployment. There's a need for a universal, reusable solution.

Method: UniCrop automatically retrieves, harmonizes, and engineers over 200 environmental variables from multiple sources (Sentinel-1/2, MODIS, ERA5-Land, NASA POWER, SoilGrids, SRTM) using a structured feature reduction workflow with mRMR to create a compact, analysis-ready feature set.

Result: Applied to 557 rice field observations, UniCrop reduced features to 15 key variables. LightGBM achieved best single-model performance (RMSE = 465.1 kg/ha, R² = 0.6576), while a constrained ensemble further improved accuracy (RMSE = 463.2 kg/ha, R² = 0.6604).

Conclusion: UniCrop provides a scalable, transparent data-engineering framework that addresses the primary bottleneck in operational crop yield modelling by decoupling data specification from implementation and supporting any crop, region, and time frame through simple configuration updates.

Abstract: Accurate crop yield prediction relies on diverse data streams, including satellite, meteorological, soil, and topographic information. However, despite rapid advances in machine learning, existing approaches remain crop- or region-specific and require data engineering efforts. This limits scalability, reproducibility, and operational deployment. This study introduces UniCrop, a universal and reusable data pipeline designed to automate the acquisition, cleaning, harmonisation, and engineering of multi-source environmental data for crop yield prediction. For any given location, crop type, and temporal window, UniCrop automatically retrieves, harmonises, and engineers over 200 environmental variables (Sentinel-1/2, MODIS, ERA5-Land, NASA POWER, SoilGrids, and SRTM), reducing them to a compact, analysis-ready feature set utilising a structured feature reduction workflow with minimum redundancy maximum relevance (mRMR). To validate, UniCrop was applied to a rice yield dataset comprising 557 field observations. Using only the selected 15 features, four baseline machine learning models (LightGBM, Random Forest, Support Vector Regression, and Elastic Net) were trained. LightGBM achieved the best single-model performance (RMSE = 465.1 kg/ha, $R^2 = 0.6576$), while a constrained ensemble of all baselines further improved accuracy (RMSE = 463.2 kg/ha, $R^2 = 0.6604$). UniCrop contributes a scalable and transparent data-engineering framework that addresses the primary bottleneck in operational crop yield modelling: the preparation of consistent and harmonised multi-source data. By decoupling data specification from implementation and supporting any crop, region, and time frame through simple configuration updates, UniCrop provides a practical foundation for scalable agricultural analytics. The code and implementation documentation are shared in https://github.com/CoDIS-Lab/UniCrop.

[756] Robust Deep Joint Source-Channel Coding for Video Transmission over Multipath Fading Channel

Bohuai Xiao, Jian Zou, Fanyang Meng, Wei Liu, Yongsheng Liang

Main category: eess.IV

TL;DR: A robust DeepJSCC framework for wireless video transmission over multipath fading channels using OFDM modulation, conditional contextual coding with multi-scale features, and lightweight denoising decoding.

Details

Motivation: To address challenges of wireless video transmission over multipath fading channels, overcoming frequency-selective fading, bandwidth constraints, and slow convergence in joint channel estimation, equalization, and semantic reconstruction.

Method: Three-stage approach: 1) OFDM modulation for robust transmission by decomposing wideband signals into orthogonal frequency-flat sub-channels; 2) Conditional contextual coding with multi-scale Gaussian warped features to model temporal redundancy; 3) Lightweight denoising module for simplified signal restoration and accelerated convergence.

Result: Significantly outperforms state-of-the-art video DeepJSCC methods, achieving average reconstruction quality gain of 5.13 dB under challenging multipath fading channel conditions.

Conclusion: The proposed robust DeepJSCC framework effectively addresses wireless video transmission challenges through integrated innovations at modulation, coding, and decoding stages, demonstrating substantial performance improvements in multipath fading environments.

Abstract: To address the challenges of wireless video transmission over multipath fading channels, we propose a robust deep joint source-channel coding (DeepJSCC) framework by effectively exploiting temporal redundancy and incorporating robust innovations at the modulation, coding, and decoding stages. At the modulation stage, tailored orthogonal frequency division multiplexing (OFDM) for robust video transmission is employed, decomposing wideband signals into orthogonal frequency-flat sub-channels to effectively mitigate frequency-selective fading. At the coding stage, conditional contextual coding with multi-scale Gaussian warped features is introduced to efficiently model temporal redundancy, significantly improving reconstruction quality under strict bandwidth constraints. At the decoding stage, a lightweight denoising module is integrated to robustly simplify signal restoration and accelerate convergence, addressing the suboptimality and slow convergence typically associated with simultaneously performing channel estimation, equalization, and semantic reconstruction. Experimental results demonstrate that the proposed robust framework significantly outperforms state-of-the-art video DeepJSCC methods, achieving an average reconstruction quality gain of 5.13 dB under challenging multipath fading channel conditions.

[757] COMPASS: High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability

Jongmin Park, Jooyoung Lee, Munchurl Kim

Main category: eess.IV

TL;DR: COMPASS is a neural network-based spatially scalable image compression method that supports arbitrary-scale spatial scalability with flexible layer structure and achieves significant BD-rate gains over existing methods.

Details

Motivation: While neural network-based image compression has shown impressive performance, most research focuses on non-scalable (single-layer) compression. Spatially scalable image compression has received less attention despite having many practical applications, creating a research gap that needs to be addressed.

Method: COMPASS uses a flexible neural network architecture where the number of layers and their scale factors can be arbitrarily determined during inference. It employs LIFF (inter-layer arbitrary scale prediction) based on implicit neural representation to reduce spatial redundancy between adjacent layers for arbitrary scale factors, and uses a combined RD (rate-distortion) loss function to effectively train multiple layers.

Result: COMPASS achieves BD-rate gains of -58.33% and -47.17% compared to SHVC and state-of-the-art NN-based spatially scalable methods respectively. It also shows comparable or better coding efficiency than single-layer coding for various scale factors.

Conclusion: COMPASS successfully addresses the gap in spatially scalable image compression by providing a flexible, efficient neural network-based solution that supports arbitrary-scale spatial scalability while outperforming existing methods in coding efficiency.

Abstract: Recently, neural network (NN)-based image compression studies have actively been made and has shown impressive performance in comparison to traditional methods. However, most of the works have focused on non-scalable image compression (single-layer coding) while spatially scalable image compression has drawn less attention although it has many applications. In this paper, we propose a novel NN-based spatially scalable image compression method, called COMPASS, which supports arbitrary-scale spatial scalability. Our proposed COMPASS has a very flexible structure where the number of layers and their respective scale factors can be arbitrarily determined during inference. To reduce the spatial redundancy between adjacent layers for arbitrary scale factors, our COMPASS adopts an inter-layer arbitrary scale prediction method, called LIFF, based on implicit neural representation. We propose a combined RD loss function to effectively train multiple layers. Experimental results show that our COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Our COMPASS also shows comparable or even better coding efficiency than the single-layer coding for various scale factors.

[758] Assessment of Clonal Hematopoiesis of Indeterminate Potential and Future Cardiomyopathy from Cardiac Magnetic Resonance Imaging using Deep Learning in a Cardio-oncology Population

Jiarui Xing, Sangeon Ryu, Shawn Ahn, Jeacy Espinoza, James L. Cross, Stephanie Halene, James S. Duncan, Alokkumar Jha, Jennifer M Kwan, Nicha C. Dvornek

Main category: eess.IV

TL;DR: Deep learning model uses routine cardiac MRI to detect clonal hematopoiesis (CHIP) and predict future cardiomyopathy risk in CHIP-positive patients, achieving AUCs of 0.71 and 0.87 respectively.

Details

Motivation: CHIP is a somatic mutation condition linked to adverse cardiovascular outcomes, but current detection methods are invasive. The paper aims to develop a non-invasive screening tool using routine cardiac imaging to identify CHIP and stratify cardiomyopathy risk.

Method: Developed a convolutional neural network using 152 multi-view late gadolinium enhancement (LGE) scans from 136 cardio-oncology patients. The model performs two tasks: CHIP detection and risk stratification for future cardiomyopathy in CHIP-positive patients. Feature importance analysis was conducted to ensure the model doesn’t rely on demographic confounders like age and medication use.

Result: The model achieved an AUC of 0.71 for CHIP detection and an AUC of 0.87 for predicting future cardiomyopathy in CHIP-positive patients, significantly outperforming demographic-only baselines.

Conclusion: LGE-CMR imaging signatures can serve as a non-invasive “radiogenomic” screening tool for CHIP detection and risk stratification, potentially enabling accessible precision medicine for high-risk cardiovascular populations.

Abstract: We propose a novel deep learning framework to identify clonal hematopoiesis of indeterminate potential (CHIP), a somatic mutation condition associated with adverse cardiovascular outcomes, using routine cardiac magnetic resonance (CMR) imaging. Utilizing 152 multi-view late gadolinium enhancement (LGE) scans from 136 cardio-oncology patients, we developed a convolutional neural network to (1) detect CHIP status and (2) stratify the risk of future cardiomyopathy specifically within the CHIP-positive cohort. To ensure robustness, we performed rigorous feature importance analysis to rule out reliance on demographic confounders such as age and immune checkpoint inhibitor usage. The model achieved an AUC of 0.71 for CHIP detection and, notably, an AUC of 0.87 for predicting future cardiomyopathy in CHIP-positive patients, significantly outperforming demographic-only baselines. These results demonstrate the feasibility of using LGE-CMR signatures as a non-invasive “radiogenomic” screening tool, potentially enabling accessible risk stratification and precision medicine for high-risk cardiovascular populations.

[759] Scan-Adaptive MRI Undersampling Using Neighbor-based Optimization (SUNO)

Siddhant Gautam, Angqi Li, Nicole Seiberlich, Jeffrey A. Fessler, Saiprasad Ravishankar

Main category: eess.IV

TL;DR: SUNO framework learns scan-adaptive Cartesian undersampling patterns and reconstruction models for accelerated MRI, outperforming population-adaptive patterns.

Details

Motivation: Current accelerated MRI uses regular undersampling or population-adaptive patterns, which can be sub-optimal for individual scans as they may miss scan-specific details and depend on population composition.

Method: Joint learning framework with alternating algorithm: uses iterative coordinate descent (ICD) to optimize scan-adaptive k-space sampling patterns for each training example, then nearest neighbor search to select patterns at test time based on low-frequency k-space information.

Result: Applied to fastMRI multi-coil knee and brain datasets, SUNO outperforms current undersampling patterns at 4× and 8× acceleration factors in both visual quality and quantitative metrics.

Conclusion: Scan-adaptive sampling patterns learned jointly with reconstruction models provide better MRI acceleration than population-adaptive approaches, capturing individual scan characteristics more effectively.

Abstract: Accelerated MRI involves collecting partial $k$-space measurements to reduce acquisition time, patient discomfort, and motion artifacts, and typically uses regular undersampling patterns or human-designed schemes. Recent works have studied population-adaptive sampling patterns learned from a group of patients (or scans). However, such patterns can be sub-optimal for individual scans, as they may fail to capture scan or slice-specific details, and their effectiveness can depend on the size and composition of the population. To overcome this issue, we propose a framework for jointly learning scan-adaptive Cartesian undersampling patterns and a corresponding reconstruction model from a training set. We use an alternating algorithm for learning the sampling patterns and the reconstruction model where we use an iterative coordinate descent (ICD) based offline optimization of scan-adaptive $k$-space sampling patterns for each example in the training set. A nearest neighbor search is then used to select the scan-adaptive sampling pattern at test time from initially acquired low-frequency $k$-space information. We applied the proposed framework (dubbed SUNO) to the fastMRI multi-coil knee and brain datasets, demonstrating improved performance over the currently used undersampling patterns at both $4\times$ and $8\times$ acceleration factors in terms of both visual quality and quantitative metrics. The code for the proposed framework is available at https://github.com/sidgautam95/adaptive-sampling-mri-suno.

[760] Explainable AI Technique in Lung Cancer Detection Using Convolutional Neural Networks

Nishan Rai, Sujan Khatri, Devendra Risal

Main category: eess.IV

TL;DR: Deep learning framework for automated lung cancer screening from CT scans with explainability, achieving up to 97.3% accuracy using transfer learning models on IQ-OTH/NCCD dataset.

Details

Motivation: Early detection of lung cancer is critical for improving survival outcomes. There's a need for automated, accurate, and interpretable screening tools, especially in resource-limited settings.

Method: Custom CNN and three fine-tuned transfer learning models (DenseNet121, ResNet152, VGG19) trained on IQ-OTH/NCCD dataset (1,197 CT scans across Normal, Benign, Malignant classes). Used cost-sensitive learning to address class imbalance and SHAP for explainability.

Result: ResNet152 achieved highest accuracy (97.3%), while DenseNet121 provided best overall balance with precision (92%), recall (90%), and F1-score (91%). SHAP visualizations successfully identified evidence contributing to predictions.

Conclusion: CNN-based approaches with explainability can provide fast, accurate, and interpretable support for lung cancer screening, particularly valuable in resource-limited healthcare settings.

Abstract: Early detection of lung cancer is critical to improving survival outcomes. We present a deep learning framework for automated lung cancer screening from chest computed tomography (CT) images with integrated explainability. Using the IQ-OTH/NCCD dataset (1,197 scans across Normal, Benign, and Malignant classes), we evaluate a custom convolutional neural network (CNN) and three fine-tuned transfer learning backbones: DenseNet121, ResNet152, and VGG19. Models are trained with cost-sensitive learning to mitigate class imbalance and evaluated via accuracy, precision, recall, F1-score, and ROC-AUC. While ResNet152 achieved the highest accuracy (97.3%), DenseNet121 provided the best overall balance in precision, recall, and F1 (up to 92%, 90%, 91%, respectively). We further apply Shapley Additive Explanations (SHAP) to visualize evidence contributing to predictions, improving clinical transparency. Results indicate that CNN-based approaches augmented with explainability can provide fast, accurate, and interpretable support for lung cancer screening, particularly in resource-limited settings.

[761] Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model

Hyun-Jic Oh, Junsik Kim, Zhiyi Shi, Yichen Wu, Yu-An Chen, Peter K Sorger, Hanspeter Pfister, Won-Ki Jeong

Main category: eess.IV

TL;DR: A novel virtual multiplex staining framework using latent diffusion models to generate multiplex biomarker images from H&E images, enabling marker-by-marker generation with improved accuracy and efficiency.

Details

Motivation: Multiplex imaging provides molecular insights beyond traditional H&E staining but faces adoption barriers due to complexity, cost, and limited multimodal datasets. Most H&E repositories lack corresponding multiplex images, restricting multimodal analysis opportunities.

Method: Leverages pretrained latent diffusion models (LDMs) with a conditional diffusion model framework. Uses marker-specific conditioning for marker-by-marker generation while sharing architecture across all markers. Fine-tunes for single-step sampling with pixel-level loss functions to handle varying pixel distributions and improve inference speed and color fidelity.

Result: Validated on two public datasets, demonstrating effectiveness in generating up to 18 different marker types with improved accuracy - a substantial increase over previous approaches (2-3 markers). Shows potential for bridging H&E and multiplex imaging.

Conclusion: The framework pioneers virtual multiplex staining, enabling retrospective studies and large-scale analysis of existing H&E repositories by bridging the gap between H&E and multiplex imaging through advanced diffusion model techniques.

Abstract: Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions by utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.

[762] Training-Free Adaptive Quantization for Variable Rate Image Coding for Machines

Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe

Main category: eess.IV

TL;DR: Training-free variable rate control for Image Coding for Machines (ICM) using adaptive quantization strength modulation across spatial and channel dimensions.

Details

Motivation: Most neural network-based ICM frameworks require individual training for each target bitrate, limiting practical usage. Existing variable rate approaches need additional training, increasing computational costs and deployment complexity.

Method: Proposes training-free quantization strength control framework that exploits hyperprior network’s scale parameter to adaptively modulate quantization step sizes across channel and spatial dimensions, preserving semantically important regions while coarsely quantizing less critical areas.

Result: Achieves up to 11.07% BD-rate savings over non-adaptive variable rate baseline, with continuous bitrate control through a single parameter.

Conclusion: The proposed training-free framework enables flexible bitrate adjustment for ICM without additional training, addressing practical deployment limitations of existing fixed-rate and variable-rate approaches.

Abstract: Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision technology into real-world applications. However, most neural network-based ICM frameworks operate at a fixed rate, thus requiring individual training for each target bitrate. This limitation may restrict their practical usage. Existing variable rate image compression approaches mitigate this issue but often rely on additional training, which increases computational costs and complicates deployment. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free framework for quantization strength control which enables flexible bitrate adjustment. By exploiting the scale parameter predicted by the hyperprior network, the proposed method adaptively modulates quantization step sizes across both channel and spatial dimensions. This allows the model to preserve semantically important regions while coarsely quantizing less critical areas. Our architectural design further enables continuous bitrate control through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate baseline. The code is available at https://github.com/qwert-top/AQVR-ICM.

Today’s Research Highlights

Table of Contents

cs.CL

[1] The Qualitative Laboratory: Theory Prototyping and Hypothesis Generation with Large Language Models

[2] Rate-Distortion Analysis of Compressed Query Delegation with Low-Rank Riemannian Updates

[3] Intention Collapse: Intention-Level Metrics for Reasoning in Language Models

[4] HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery

[5] Multi-Dimensional Prompt Chaining to Improve Open-Domain Dialogue Generation

[6] KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs

[7] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

[8] Unsupervised Text Style Transfer for Controllable Intensity

[9] Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation

[10] Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

[11] ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

[12] EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation

[13] ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

[14] Listen, Attend, Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels

[15] MIND Your Reasoning: A Meta-Cognitive Intuitive-Reflective Network for Dual-Reasoning in Multimodal Stance Detection

[16] RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution

[17] KOS-TL (Knowledge Operation System Type Logic)

[18] SongSage: A Large Musical Language Model with Lyric Generative Pre-training

[19] DHI: Leveraging Diverse Hallucination Induction for Enhanced Contrastive Factuality Control in Large Language Models

[20] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

[21] Almost Clinical: Linguistic properties of synthetic electronic health records

[22] Stylometry Analysis of Human and Machine Text for Academic Integrity

[23] Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure

[24] From Policy to Logic for Efficient and Interpretable Coverage Assessment

[25] Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory

[26] T3C: Test-Time Tensor Compression with Consistency Guarantees

[27] FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

[28] Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems

[29] FC-CONAN: An Exhaustively Paired Dataset for Robust Evaluation of Retrieval Systems

[30] Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

[31] EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

[32] LANCET: Neural Intervention via Structural Entropy for Mitigating Faithfulness Hallucinations in LLMs

[33] From Emotion Classification to Emotional Reasoning: Enhancing Emotional Intelligence in Large Language Models

[34] iFlip: Iterative Feedback-driven Counterfactual Example Refinement

[35] Segmentation and Processing of German Court Decisions from Open Legal Data

[36] Can Legislation Be Made Machine-Readable in PROLEG?

[37] Four Quadrants of Difficulty: A Simple Categorisation and its Limits

[38] Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints

[39] From Failure to Mastery: Generating Hard Samples for Tool-use Agents

[40] EmoHarbor: Evaluating Personalized Emotional Support by Simulating the User’s Internal World

[41] Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM

[42] HalluZig: Hallucination Detection using Zigzag Persistence

[43] Steerability of Instrumental-Convergence Tendencies in LLMs

[44] How Does Prefix Matter in Reasoning Model Tuning?

[45] JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

[46] EHRSummarizer: A Privacy-Aware, FHIR-Native Architecture for Structured Clinical Summarization of Electronic Health Records

[47] A Training-Free Large Reasoning Model-based Knowledge Tracing Framework for Unified Prediction and Prescription

[48] K-EXAONE Technical Report

[49] Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment

[50] Can LLMs Track Their Output Length? A Dynamic Feedback Mechanism for Precise Length Regulation

[51] BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali

[52] CSCBench: A PVC Diagnostic Benchmark for Commodity Supply Chain Reasoning

[53] Aspect Extraction from E-Commerce Product and Service Reviews

[54] Emergent Introspective Awareness in Large Language Models

[55] Towards Automated Lexicography: Generating and Evaluating Definitions for Learner’s Dictionaries

[56] Judging with Personality and Confidence: A Study on Personality-Conditioned LLM Relevance Assessment

[57] DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs

[58] Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

[59] Tackling the Inherent Difficulty of Noise Filtering in RAG

[60] CSF: Contrastive Semantic Features for Direct Multilingual Sign Language Generation

[61] Hidden State Poisoning Attacks against Mamba-based Language Models

[62] Surprisal and Metaphor Novelty: Moderate Correlations and Divergent Scaling Effects

[63] Not All Needles Are Found: How Fact Distribution and Don’t Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs

[64] Cost-Efficient Cross-Lingual Retrieval-Augmented Generation for Low-Resource Languages: A Case Study in Bengali Agricultural Advisory

[65] Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows

[66] DeCode: Decoupling Content and Delivery for Medical QA

[67] Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

[68] FormationEval, an open multiple-choice benchmark for petroleum geoscience

[69] Confidence Estimation for LLMs in Multi-turn Interactions

[70] Toward Global Large Language Models in Medicine

[71] From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

[72] CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

[73] pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

[74] Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLMs)

[75] Classifying several dialectal Nawatl varieties

[76] Estimating Text Temperature

[77] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling